Auto Crawling ChatGPT

Auto Crawling ChatGPT

Today we released a major feature release to our ChatGPT website chatbot products. It's basically automatic crawling of your website once each day to check for new pages and content. If the crawler finds new content, it will automatically add this to your existing training snippets, implying your chatbot automatically becomes "smarter" over time.

It needs to be configured on your machine learning model to enable it, so edit your machine learning model, and make sure you set its "Website" property to the primary domain you want to automatically crawl. If you're using the chatbot wizard on a new model, this will automatically be wired up for you.

You'll need to have remaining training snippets on your model for this to work, which depends upon what license you have. How many training snippets you can have on an individual model depends upon your license. Below is a list for reference purposes.

  • Basic, 500 snippets
  • Pro, 1,250 snippets
  • Enterprise, 2,500 snippets

For the record, we don't count pages, we count training snippets. However, typically one web page becomes somewhere between 1 and 10 training snippets, repending upon your page's structure. The average would typically be 5. The auto crawl feature works on your model, which implies the feature works just as well for AI Website Search as it does for ChatGPT website chatbots.

How we're counting

The reasons we're not counting pages, is because pages are quite franlky irrelevant. To create a high quality chatbot, you typically don't care about pages, what you care about are "constructs", and using entire pages as a single training snippet, is quite frankly counter productive. When you create a context for OpenAI and ChatGPT to use, you want to be able to provide as many different concepts as possible. This allows OpenAI to "create associations" where you for instance ask; "Who's the best in xyz of a, b and c". If the first training snippet for person "a" is too big, spending the entire context window, there is no room for "b" and "c" in your context, significantly reducing the quality of the response.

When we scrape your website, we go through an insane amount of "intelligent compression logic" to ensure we only end up with high quality context data. Some of our means are to chop up individual web pages into multiple training snippets, according to its header structure and its paragraph structure. This of course significanly increases quality of the end product for you, because of being able to provide a better context for OpenAI.

This logic however increases the number of context training snippets in your model, and since our cost is per snippet and not per page, we need to count snippets and not pages. However, we never insert the same training snippet twice. Since we're chopping up all pages into multiple snippets, this almost completely eliminates redundant and repeated information, such as navbar constructs, and common footers, etc.

Half of our clients sends us an initial email resembling the following; "We've tried everything out there, after having played with your chatbots for a week, we realise it's simply superior". Now you know why ...

4x database speed

In addition to the above auto-crawl feature, we've also increased the database speed by 4 to 6 times. This will be especially noticeable on larger models, making their response time faster. We can of course not speed up OpenAI itself, but at least the CPU time spent gathering your context now is at roughly 20% of the time it used to be. Particularly for larger models this will be noticeable. We might choose to "give back" some of these gains in the future by expanding the size of our packages, but we'll have to do more testing before we can confidently do just that.


We'll try to come out with at least one major release such as this, each week in the future. Last week we reduced the size of the embed script by 95%, the week before that we released AI Website Search, this week it was backend improvements. However, if you got stuck with our stuff previously because of lack of performance, let us know where you're stuck, and we'll prioritise it.

Thomas Hansen

Thomas Hansen I am the CEO and Founder of AINIRO.IO, Ltd. I am a software developer with more than 25 years of experience. I write about Machine Learning, AI, and how to help organizations adopt said technologies. You can follow me on LinkedIn if you want to read more of what I write.

Published 30. Apr 2023