Markdown to HTML AI Website Scraper

Markdown to HTML AI Website Scraper

AI is only as good as the input you give it. Everybody who's done some prompt engineering with ChatGPT knows this. The difference between what an experienced prompt engineer can achieve with ChatGPT and somebody who's never used it before is huge. The same is true for AI chatbots.

Trash in, trash out

An AI chatbot such as ours is typically created by scraping some website. The problem is that 98% of your website's content is not relevant to the chatbot. If you click "View source" on this page for instance, you'll see something resembling the following.

View source on a webpage

In the above screenshot of this page less than 5% of the actual content is relevant to the AI. This is a problem when you're dealing with RAG. First of all because if you provide irrelevant data to OpenAI as context, it becomes "confused". Secondly, you're spending valuable tokens to transmit trash to the LLM.

If you send trash to the LLM you're basically "lobotomizing it". Trash in, trash out!

Extracting the text

One solution to this is to simply return all the text content from the page. This is fairly easy if you've got an HTML parser. This solves 50% of the problem, and is what most of our competitors are doing.

This creates new issues for you. First of all the text content still becomes too large because it will extract things such as your navbar, your copyright notice, and all the "boiler plate template code" you've got on all of your pages.

If you scroll to the top or bottom of this page for instance, you will see our navbar hierarchy at the top. At the bottom you will see an "About the author section" in addition to another navbar hierarchy. Repeating this content in every single record as you create your RAG database implies that every single record will contain 50% similar content. Hence, you're still wasting 50% of your context.

You turn the LLM into a "mentally challenged inferior intelligence" with 50% of its capacity for "intelligence"

The above is like giving Einstein access to only CNN and Sky News for the whole of his life, resulting in no theory of relativity for obvious reasons ...

Creating Markdown from HTML

When we scrape a website, we will take the original HTML, semantically parse it, and create Markdown from it. Below you can see one such training snippet from one of our demo chatbots.

Markdown training snippet

The above is first of all only a tiny sub-section of the original page. However, once we've got the Markdown, we will "chop this up" into multiple snippets. First of all we will treat all lists on the page as individual training snippets. In addition we will keep hyperlinks and images and create Markdown out of these.

Then we insert multiple records for each page as we scrape your website. Typically, this creates some 5 to 10 training snippets for every single page in your site. However, and here comes the crucial part of what we do: We only insert training snippets we have not previously inserted.

The RAG database therefor contains very few repeating records of information!

The last part above is crucial, since each list becomes a separate training snippet by itself, something you can see in the above screenshot. This ensures that your navbar hierarchy and "boiler plate parts" is only inserted once into your RAG database.

When we later do RAG retrieval from your database, there are few repetitions in your RAG records. In addition, we get to stuff more information into the context we're sending to ChatGPT, because we don't have repeating information to the extent others do. This is actually a software development design principle referred to as DRY, or "Don't Repeat Yourself", and the same principle applies to everything related to AI.

The AI chatbot that displays images

The above benefits should be obvious, but there is one crucial feature this gives us we haven't found with any of our competitors so far; Our AI chatbot can display images. Read more about this in the following link.

When you create Markdown from HTML you first of all clean up the 98% "garbage information". Secondly, you get to keep images, lists, and hyperlinks as they are, allowing us to instruct OpenAI to "return relevant images and hyperlinks as Markdown". Below is a screenshot of how this allows our AI chatbot to display images.

Screenshot of AI chatbot displaying an image of Thomas Hansen

If you look at the above image of me, you will notice it's the same image further down on this page. This is because the above image was taken from our website as we crawled and scraped it. The process is 100% automatic, and didn't require any manual work from our side. Our Markdown website scraper did everything automagically.

Conclusion

AI is not a silver bullet. The AI will not magically compress all garbage into diamonds, it's just not happening. To fully take advantage of AI, you need to feed it with high quality data to make it perform better. If you transform the HTML to Markdown though, it becomes the equivalent of finding that single diamond in a mountain of garbage. Once you increase quality, new axioms emerges, and new capabilities are discovered, such as our chatbot being able to display images and hyperlinks.

Being able to transform HTML into Markdown, chopping the resulting Markdown into multiple database records, significantly increases quality, and avoids repeating information. This creates more records in your RAG database, but each record ends up having less repeating information. This process significantly increases the quality of your AI chatbot, making it become "smarter".

This allows ChatGPT to work with more information as it's answering your questions

Yesterday we came out with a massive release, that further improves upon our chatbot's ability to correctly transform HTML to Markdown. Before we released it, we tested it on 649 websites to make sure it performed well - And as it scraped 649 pages, it didn't produce a single broken link or image, unless the HTML itself was referencing broken links and images itself.

We are constantly improving upon our chatbot's abilities, and increasing the quality of our web scraper is one of our most important tasks in these regards. To put that into context, realize Google is worth a trillion dollar or something. 50% of their value as a company is literally their web scraper. Remove Google's web scraper, and it's worthless as a company in 6 months. We know this of course, and Google knows this. Now you know it too.

An AI chatbot without a kick ass scraper is basically like Google without a scraper, literally worthless. We've got a kick ass scraper, and according to our own internalt tests, it outperforms everything we've seen out there so far. You can contact us below if you want help with embracing AI.

Psst, the above list and hyperlink is only inserted once in our RAG database, something you can verify if you've got your own cloudlet, and you scrape our website ... 😉

Thomas Hansen

Thomas Hansen I am the CTO of AINIRO.IO AS. I am a software developer with more than 25 years of experience. I write about Machine Learning, AI, and how to help organizations adopt said technologies. You can follow me on LinkedIn if you want to read more of what I write.

Published 31. Jan 2024