How to Scrape a URL for your Custom GPTs

How to Scrape a URL for your Custom GPTs

Our website scraper have always been the envy of the industry, and it's also the corner stone of our AI Chatbot.

When we released AI Workflows, we decided to make the scraper publicly available for everyone to use, allowing you to create an API endpoint that accepts a URL - For then to scrape the specified URL, and return the content as high quality Markdown.

By transforming the original URL's content into Markdown, we get to preserve hyperlinks, images, lists, bold and italic text, etc. This allows the LLM to work with high quality data, contaning links and images, where the relative importance of individual parts are preserved.

The result is 10x quality scraping because of preserving Markdown

Scrape URLs for your Custom GPTs

I have probably seen a "bajillion" custom GPTs who's purpose is to create Facebook ads, LinkedIn articles, Twitter updates, etc. They all have in common that they ask me for my content. At this point I'm expected to copy and paste my content into ChatGPT.

Some of these custom GPTs are incredibly intelligently put together, and I'm super impressed by some of them. However, without a high quality website scraper, it misses the mark-down (pun!)

For instance, there's no way I can tell the GPT to preserve original hyperlinks, preserve images, or preserve the relative importance of individual list items or headers. For the GPT, what I provide to it is "simple text". If you provide it with Markdown, it'll be able to preserve hyperlinks, images, and the relative importance of specific items - Due to Markdown preserving Hx elements, list items, hyperlinks, and images. The quality improvement of my custom GPT has already at this point increased 10x. Watch the following YouTube video for a demonstration of how to create such a scraper.

Use cases

The possibilities here are endless. Below are some use cases I've thought about myself.

  • Translate GPTs that translates a URL into some language while keeping all information as is
  • Summarize GPTs that summarizes the most important parts of articles, allowing me to read them much faster
  • Marketing GPTs that creates social media updates or ads based upon my existing content
  • Learning GPTs that takes original content and explains it to children using easily understood language

All of the above use cases originating from a single feature; The ability to scrape a URL and preserve the URL's content as Markdown.

The code

If you don't have time to watch our YouTube video, you can create a new folder inside your cloudlet, and then create a new file called "get-context.get.hl" inside this folder - For then to copy and paste the code below into that file. This will give you an API endpoint inside your cloudlet that allows you to scrape a URL from your custom GPT. Just remember to install the OpenAI plugin first, since the code relies upon a workflow action from this plugin.

// Scrapes the specified URL
.arguments
   url:string

// Returns [max_tokens] context from the specified [url].
execute:magic.workflows.actions.execute
   name:openai-context-from-url
   filename:/modules/openai/workflows/actions/openai-context-from-url.hl
   arguments
      url:x:@.arguments/*/url
      max_tokens:int:8000

// Returns the result of your last action.
return-nodes:x:@execute/*

The end result should resemble the following.

Creating the scrape URL API endpoint in Hyper IDE

Then save the file, and click the HTTP icon at the top of your tree view. This gives you the OpenAPI specification you need to paste into your custom GPT as an "action".

Conclusion

You can of course tell the user to copy and paste the content, and that will give you 50% of the value. However, it will not be able to preserve images, it will not be able to preserve hyperlinks, and it will not be able to preserve the original structure and relative importance of individual parts of your content.

With a URL scraper creating Markdown from URLs, all of the above problems are solved, and you're able to deliver an amazing solution, separating your particular GPT from everybody elses GPTs.

Your custom GPTs have now increased 10x in quality, opening up a whole new axiom of use cases, almost impossible to solve without a website scraper - In addition to making it 10x easier to use and more user friendly

Notice, for all of our existing clients, the above is an integrated part of our services, and something they can assemble in some few minutes. In the above YouTube video, I walk you through the required steps in 5 minutes, allowing you to create a custom GPT with URL scraping capabilities for yourself, in less than 5 minutes. Contact us below if you want to get started.

Thomas Hansen

Thomas Hansen I am the CEO and Founder of AINIRO.IO, Ltd. I am a software developer with more than 25 years of experience. I write about Machine Learning, AI, and how to help organizations adopt said technologies. You can follow me on LinkedIn if you want to read more of what I write.

Published 11. Feb 2024