How to Create a RAG-based AI Chatbot

RAG-based AI chatbots aren't really that difficult to create. In this article I will explain how we created ours, and its core architecture, allowing you to understand its internals, and maybe create one for yourselves. I will not provide code examples, but instead explain the flow and architecture, to make sure this tutorial works for all programming languages.
And notice, our AI platform is 100% open source, and you can clone it here. If you do, you can easily follow along the code examples and reproduce what I'm doing in this tutorial.
VSS
The most important parts of a RAG-based AI chatbot is VSS. VSS implies "Vector Similarity Search" and allows us to find relevant information based upon natural language. The way it works, is that when we save RAG data we invoke OpenAI's embeddings API. The embeddings API will take text input and transform it into a normalised vector. These vectors typically have either 1,536 or 3,072 dimensions, implying it's just a list of 1,536 floating point values. Then when OpenAI returns embeddings for each individual RAG record we will store this as a part of the record allowing us to later match prompts to RAG data.
Later when users are prompting the chatbot, we will use the embeddings API on their prompt, and perform a "similarity search" through the RAG training data database, and find records that are the most similar to the specified query. This allows us to calculate the "distance" between each individual RAG record and the prompt, and order records such that those having the smallest distance are prioritised and passed into OpenAI's chat endpoint together with the prompt to answer the user's question. Below is a flowchart diagram illustrating the flow.
As we submit the RAG context to OpenAI, we make sure we've previously prompt engineered the LLM through its system instruction to "exclusively answer the user's questions with information taken from the context". This literally completely eliminates AI hallucinations, while simultaneously providing additional information to the LLM that it didn't previously have knowledge about. This can be verified by seeing how our chatbots will refuse to answer questions it doesn't have the answer to in its RAG database.
To understand this try to ask our AI chatbot "how do I bake a cake?" Where the point is that it will refuse to answer even a basic question such as this, simply because we don't have cake recipies in our RAG database.
This simple little trickery completely eliminates AI hallucinations, resulting in that the chatbot will never try to "make facts up". In the screenshot below you can see this "distance" for an individual query doing lookups through our machine learning type for the prompt "Tell me about your AI chatbot technology". Notice the [0.14] parts in the left column. This is the "distance" between the prompt and the individual RAG training snippet.
Once we've found the most relevant records, we just attach these one at the time, starting with the closest match, until we've filled up the "context window" with some 4,000 to 100,000 OpenAI API tokens, depending upon how the type has been configured. Below you can see how to configure these parts with Magic.
In the above screenshot you can see the "max context token window size" in the bottom/left corner.
SSE
"SSE" implies "Server Side Events" and allows an HTTP endpoint to "stream" back content. This is just a plain old HTTP invocation, with one crucial difference being that instead of getting the whole response at once, you get smaller parts of the response as it is available from the LLM's side.
By combining OpenAI's SSE abilities with SignalR, we can use web sockets to "stream" tokens and words back to the client that's connected to our backend. This is why you can see our AI chatbot answering one word at the time, and it resembles a human being manually typing "one word at the time", reducing the amount of time to when the user can start reading the response, creating a superior user experience. You can try this with our AI chatbot if you want to see it, but it's basically the same way ChatGPT is using to write individual words out, one word at the time.
SignalR is for .Net and C#, but there are similar libraries out there for Python, GoLang, Java, and every other major programming language in existence.
Wrapping up
The above is the basic architecture for a RAG-based AI chatbot, and assuming you know how to code, you should be able to implement something resembling our AI chatbot if you understand the above process. To understand what is actually happening, or to explain it to people who cannot create software, we typically explain the process as follows;
We put ChatGPT on your website, and make it say what you want it to say!
Such AI chatbots are incredibly valuable serving as AI sales executives, AI-based customer service and support, or AI expert systems for your back office workers, etc. If you don't want to create it manually, you can download our open source platform and use it to create one for yourselves. You can find a link to our open source repository below and simply use the "Chatbot Wizard" that takes care of wiring all of these contstructs together providing you with a simple out of the box solution. You can see a screenshot of the wizard below.
Below you can clone Magic Cloud.
If you don't know anything about software development or even what "cloning" implies, you can contact us below to have us create everything for you.