Create an AI Chatbot by Scraping your Website

There are roughly 1 billion websites in the world today. OpenAI's ChatGPT is created by "scraping" and "crawling" these. Scraping websites again implies downloading the website text, and somehow storing it in a database - And "crawling" implies finding all pages on a single site. This is a bit over simplified, but as a general rule it explains how website scraping works.
The big thing about our technology is that it allows you to scrape and crawl websites yourselves, while creating a database of RAG data, that can be used to create "context" for OpenAI's APIs. This allows you to deliver a custom AI chatbot, based upon your data, that uses your website content to answer customer service questions. You can try such a chatbot at our website here.
RAG versus Fine Tuning
RAG translates to "Retrieval-Augmented Generation", and allows you to "seed" OpenAI's AI models with data, having the AI exclusively use this data to answer the user's questions. If you ask ChatGPT who's its CEO for instance, it will probably answer Sam Altman. If you ask our chatbot who's your CEO, it will tell you it's me. This is one example of how we can put ChatGPT on your website and have it answer questions the way you want it to.
This allows you to use your website as your source, and create an AI chatbot that answers questions from its users, where the information used to answer these questions are originating from your website. For a company with a lot of manual customer service requests, this simple trick can sometimes eliminate 50% of your manual support tickets, and sometimes even more. We've got clients that are saving $20,000 per month with our RAG technology.
In fine tuning however, you'll need to train your own LLM, typically based upon some base model from OpenAI or other LLM providers, and then use this information to create a new LLM. This process is much more expensive than to create a RAG database, and it doesn't eliminate what the industry refers to as "AI hallucinations". An AI hallucination happens when the LLM is being asked a question it doesn't know the answer to, at which point it will start making up facts. Hence, having a strong RAG foundation is simply technologically superior to creating your own LLM, and it's much less expensive and requires much less work.
Markdown and Images
When our technology scrapes your website it will create Markdown from the HTML it finds. This has huge benefits for the quality of the RAG data we're capable of creating. The most extreme example of this is maybe how we're able to have our AI chatbots display images, something you can see from the above screenshot. This is a feature almost none of our competitors are able to deliver, even though it's probably one of the most important features an AI chatbot can have. We've written extensively about the subject in the article below.
This quality difference originates from being able to transform your website's HTML into Markdown, which keeps the "semantics" when creating your RAG database. When the LLM can see what's your header, what's your title, where a paragraph starts and stops, and which words are bold, and so on - It becomes much "smarter". So the ability to scrape a website and generate Markdown from its content, simply produces superior quality RAG data. If you're to create a short list of features you should be looking for you can use the following as a checklist to use when purchasing an AI chatbot.
- Can I scrape and crawl my website for RAG data?
- Does it produce Markdown RAG data?
- Does it preserve images found during scraping?
If your AI chatbot vendor doesn't support all of the above, you should probably consider another vendor. Below is a screenshot from our technology, demonstrating how it produces Markdown as it builds its RAG database.
The above might not look like a big thing for a human being, but for the LLM it's all the difference in the world.
How scraping works
There's a lot of really bad website scrapers out there - But the general idea for most of these is that it follows the following process.
- Read the "robots.txt" file and find the sitemap
- Read the sitemap file(s) and scrape these in sequential order
- Build a RAG database from the scraped content
- Continue until it's scraped all pages in your website
Some websites turns off scraping, and installs "scraping shields" to prevent scrapers from working. Not only is this an incredibly risky thing to do, since it might also prevent Google and other search engines from indexing your website - But it also makes our job that much more difficult. If you've got one of these, please turn it off. Nobody cares about your website enough for being willing to spend a lot of time to steal it - Sorry, I just said it. Below you can see a screenshot of me scraping a website for RAG data.
Is scraping legal?
As a general rule of thumb scraping is 100% legal, and you can scrape any website you wish, as long as you obey by the rule-set in the website's "robots.txt" file. Our crawler explicitly identifies as an AINIRO crawler. This allows website owners to shut it out and prevent it from doing its job. If the site doesn't explicitly stops our scraper, then using it on other people's websites is 100% legal.
However, just because something is legal doesn't imply it's morally right. We have no customers whom are scraping other people's websites en mass, and we reject such customers every time we're asked if we can do it. Stealing other people's content to build an AI chatbot from their intellectual property is not "OK", and we're not going to help you do it - Regardless of whether or not it's legal. However, "spicing" your model with some public and general information, taken from a single page on some website you don't own, is both 100% perfectly legal and moral, and we have no issues helping you out with this if required.
Without website scraping no search engine would work. Google and Bing for instance, is 100% dependent upon having access to scrape and crawl your website, and they scrape millions of websites every single week. If website scraping wasn't legal, both Google and Bing would have to file for Chapter 11, since without scraping it becomes impossible to build a search engine.
Wrapping up
We've looked at how website scraping works, why it's important, and what features you should ask your scraping technology provider for. We've looked at why Markdown matters in your RAG database, and we've looked at the importance of preserving images during the process. We've also discussed the legality of scraping a website, in addition to ethics and morals related to the subject.
At AINIRO we've got a "bajillion" ways to create your RAG database. We support uploading documents, Excel files, PDF files, connecting to APIs, and literally every possible way to extract data from somewhere and generate a RAG database. However, the simplest way to generate RAG data is to point our crawler to your website and click a simple button. If you want to try this out for free on your own website, you can fill out the form here, and you'll have a free 7-day demo AI chatbot built from your content in some 5 minutes. If you're interested in talking to us about alternative means to create a RAG database for you, resulting in an AI chatbot based upon your content, you can contact us below.