HOWTO keep your RAG Database DRY during Website Scraping

HOWTO keep your RAG Database DRY during Website Scraping

DRY implies "Don't Repeat Yourself", and is crucial when it comes to anything related to software development, including how to create your RAG database. You see, if you look at your website, you will find tons of repeating information; Its copyright notice, navbar, header, footer, etc, etc, etc.

Unless you remove this before creating your RAG database, you're repeating the same information for every single RAG training snippet you've got, which is not only a waste, but also prevents the LLM from getting access to all the information it needs to answer your user's questions.

We refer to this data as "cognitive noise".

How to keep your RAG database "DRY"

In the following video, I'm demonstrating how to "semantically" crawl and scrape a website instead. The point is that instead of just scraping the whole content, and dumping it into my database, I use the Hyperlambda generator to "generate" a custom Hyperlambda script, that only imports the unique parts, from every single web page in my site.

In the above video the most extrem saving is 5x. That's from simply "dumping the HTML into the database" to the much more refined version "extract article element and transform to Markdown".

  • HTML is 9,500 tokens
  • Pure Markdown of only ARTICLE element is 2,000 tokens

Savings becomes almost 5x or 500%! However, even for the most "optimal" version, that only returns Markdown of the whole page, you'd still be looking at 50% improved RAG quality - And everything was done from Magic's dashboard, using natural language as follows ...

Crawl ainiro.io's sitemap for all URLs containing the string '/blog/', and extract the TITLE element, in addition to the Markdown of the first ARTICLE element you find. Then for each of these, insert into database 'magic' and table 'ml_training_snippets' using 'type' of 'next', and 'prompt' as TITLE and 'completion' as the Markdown.

So basically, a little bit more work, but the result becomes 300% "better" ...

Thomas Hansen

Thomas Hansen

I am the CEO and Founder of AINIRO.IO, Ltd. I am a software developer with more than 25 years of experience. I write about Machine Learning, AI, and how to help organizations adopt said technologies. You can follow me on LinkedIn if you want to read more of what I write.

This article was published 7. Jan 2026

Self Evolving AI Agents

With our latest release of Magic Cloud, we can now deliver 'self evolving' AI agents, implying agents that creates tools on demand.

The best Web Scraper in the Industry

With our recent additions to our Hyperlambda generator, it is safe to assume that we've got the by far best commercially available web scraper in the world.

Hyperlambda, a Web Query Language

With our Hyperlambda Generator you can treat the web as an API, querying it using natural language, and return structured JSON

Copyright © 2023 - 2025 AINIRO.IO Ltd