Yesterday we were able to connect our ChatGPT chatbots to the internet. This feature is still experimental, and it doesn't always work - But today we were able to sort out 90% of the bugs related to this.
In this article I will highlight some of the more important things we did to accomplish this, and share my findings with the rest of the world. Hopefully it'll be useful for others trying to accomplish the same.
We're a Hyperlambda shop of course, but you can probably transpile our ideas to your programming language of choice.
Try ChatGPT with Internet Access
Before I start explaining what we did, do me a favour. Click our chat button in the bottom/right corner of this page, and write the following into it;
Find me information for the following query "Does WHO declare COVID-19 to be a pandemic in 2023?"
The point being that you'll get something such as the following back from it.
The difference between the above answer and what ChatGPT answers you is obvious I presume. Our chatbot can reach out to the internet using DuckDuckGo and scrape resulting websites, allowing us to have ChatGPT deal with real time information from the internet. Notice, it will only reach out to the internet if you write your query as follows;
Find me information for the following query "QUERY_HERE"
Now that we've got the semantics out of the picture, let's look at some of the things we had to do to increase the quality of this process. Initially, only about 40 to 50 percent of our queries would succeed, and below I will explain why, and what we did to increase this number beyond 90%.
We've probably got the best web scraping technology in the industry, and we've learned a lot as we've been using it over the last 7 months, scraping dozens of websites every single day due to our get a free AI chatbot web form. This puts us in a unique position to understand how to create high quality AI training data from websites, and all sorts of different sources - And you'd be surprised by how much of "the AI problem" is good old fashion software development, with algorithms, architecture, composition, software design, and simple code.
If you want better AI, write better traditional code 😉
Some of the more important findings in regards to website scraping we've done are as follows.
Not all websites CAN be scraped
We try to be a "good scraping citizen". With this I mean we clearly identify our spiders as website scrapers, using unique identifiable HTTP User-Agent headers, and we try our best to respect websites to avoid overloading them as we scrape them. More work can be done here, but at least contrary to most others, we don't "hide" the fact that we're scraping your website.
However, not all websites allows for being scraped. Some websites simply shuts off all web scrapers they can identify. Some have web firewalls, preventing anything but "human beings" to access these and scrape these - Which creates a problem for us as we try to retrieve whatever information we can find at these sites.
The way we solve this, is by invoking DuckDuckGo and retrieve the top 5 hits for whatever query the user is searching for. Then we retrieve all of these in parallel, with a timeout of 10 seconds. Why the timeout? Because some sites will "block you from getting data, while keeping the socket connection open", implying they will never return. The idea is that unless the site returns its HTML in less than 10 seconds, we'll release the HTTP connection, and simply ignore this URL.
Out of 5 hits from DuckDuckGo typically 1 or less will block. Since we're fetching information in parallel, async from 5 URLs, we'll still get "some information from 2/3 websites" 98% of the time. And the process as a whole will never take more than 10 seconds due to our timeout. The timeout is crucial for us, since we don't persist data locally, but always fetches it on demand, implying 10 seconds to scrape web pages becomes 10 extra seconds to get your answer from ChatGPT.
Below is the primary entry point code. Even if you don't understand Hyperlambda, you should be able to understand the general idea, and possibly transpile it into your programming language of choice.
/* * Slot that searches DuckDuckGo for [max] URLs matching the [query], * for then to scrape each URL, and aggregating the result * returning it back to caller as a single Markdown. */ slots.create:magic.http.duckduckgo-and-scrape // Sanity checking invocation. validators.mandatory:x:@.arguments/*/query validators.string:x:@.arguments/*/query min:3 max:250 validators.integer:x:@.arguments/*/max min:1 max:10 // Searching DuckDuckGo for matches. add:x:+ get-nodes:x:@.arguments/* signal:magic.http.duckduckgo-search // Building our execution object that fetches all URLs simultaneously in parallel. .exe // Waiting for all scraping operations to return. join for-each:x:@signal/*/result/* // Dynamically contructing our lambda object. .cur fork .reference try unwrap:x:+/* signal:magic.http.scrape-url url:x:@.reference/*/url semantics:bool:true .catch log.error:Could not scrape URL url:x:@.reference/*/url message:x:@.arguments/*/message // Adding URL and title as reference to currently iterated [fork]. unwrap:x:+/*/* add:x:@.cur/*/fork/*/.reference . url:x:@.dp/#/*/url title:x:@.dp/#/*/title // Adding current thread to above [join]. add:x:@.exe/*/join get-nodes:x:@.cur/* // Executing [.exe] retrieving all URLs in parallel. eval:x:@.exe /* * Iterating through each above result, * returning result to caller. * * Notice, we only iterate through invocations that have result, and * did not timeout by verifying [signal] slot has children. */ for-each:x:@.exe/*/join/*/fork // Verifying currently iterated node has result, containing both prompt and completion. if exists:x:@.dp/#/*/try/*/signal/*/*/prompt/./*/completion .lambda // Adding primary return lambda to [return] below. unwrap:x:+/*/*/* add:x:../*/return . . url:x:@.dp/#/*/.reference/*/url title:x:@.dp/#/*/.reference/*/title snippets // Adding [snippets] to return below. add:x:../*/return/0/-/*/snippets get-nodes:x:@.dp/#/*/try/*/signal/* // Returning result of invocation to caller. return
The basic idea is as follows;
- Query DuckDuckGo and scrape the reulting top 5 URLs
- Create one async thread for each result, and retrieve these from their respective URLs, with a timeout of 10 seconds
- Wait for all threads to finish, and create an aggregated result
There is a lot more code related to this, but since Magic is Open Source, you can study its code for more details. For instance, we do a lot to try our best to create Markdown out of the resulting HTML. This significantly reduces the amount of data we're sending to ChatGPT, while also keeping hyperlinks, images, and lists in their semantic form. This is why our chatbot can display images, hyperlinks, and lists the way it does. This simple fact increases the quality of our chatbot alone by at least 1 order of magnitudes.
We do NOT STEAL your information
One thing we do different, is that we try our best to always provide source and references to our users, if we can fit it into the context. This implies it'll typically end its explanation with something like "This information was fetched from the following URLs; abc, xyz".
This is first of all the polite thing to do, and secondly it allows our users to fact check what our chatbots are telling you. The end result becomes that instead of "stealing traffic from your website", we'd probably instead GIVE your website additional traffic - Since users would probably want to fact check their queries by reading the source DuckDuckGo provides us with.
This is hard. I remember my former partner saying; "Why should I invest in something anybody can copy and steal?" Well, so far we're the only one in the industry able to do what we're currently doing. We're basically "10 years ahead of the competition", and nobody are able to "copy us" - Even though I do my best to help them copy our ideas every single day, by exclusively innovating openly in the public space, and open source licensing 99% of every single line of code I write 😂
You were wrong, I was right, check. 7 billion 999 million 999 thousand and 999 more to go 😂