The best Web Scraper in the Industry
Over the last couple of months we've been working on our Hyperlambda generator. The point about this work, is that what we've taught it is to semantically scrape web pages and sites, and return structured JSON information. To understand the point, you can try it out at our Natural Language API.
Examples of prompts you could run includes;
- Crawl all URLs from the sitemap at xyz, return all H1 headers, titles, and meta descriptions
- Return all external hyperlinks from xyz
- Find all broken images on the page xyz
- Etc ...
The point about the above is that instead of scraping the whole page, and using an LLM to extract JSON, it generates deterministic code that executes the same way every single time it executes, and only returns the requested information back to the caller. To understand the value proposition realise that our website's landing page is 8,000 tokens. If I run a Hyperlambda script extracting H1, title, and description, it will probably be less than 80 tokens.
Implying our web scraper consumes 1% of the tokens required by others!
It's NOT about Speed
Don't get me wrong, we also happen to have the fastest LLM in the world. An average request takes about 1 to 4 seconds. You can try it out for yourself using the above link if you don't believe me - But even that's not its primary feature. Nope, its primary feature is that it enables us to do things previously impossible to even imagine.
When you reduce token consumption by 99%, new axioms of capabilities naturally emerges. For instance, imagine the following prompt;
Scrape xyz's sitemap, then crawl all pages and return URLs of all hyperlinks on all pages that does not return success
The above is an example of a prompt that would probably do exactly what you think. Using vanilla ChatGPT for the above, is first of all impossible. However, even if it was possible, the above would consume on average 4,000 tokens multiplied by 450 pages. That becomes 1.8 million tokens. First of all, there's no LLM in the world capable of dealing with 1.8 million tokens. The largest LLMs can maximumly deal with 1 million tokens. In addition, even if you could get the above to work, it would probably hallucinate a lot, and probably also spend hours to execute. I've been executing the above towards 450 pages with our scraper and the entire logic executed in some roughly 60 seconds!
Implying you're talking about 1% of energy consumption, 1% of costs, and 100 times faster!
Wrapping up
A partner of us paid $700 to a Fiverr consultant a couple of months ago to identify broken hyperlinks on a website with 2,000+ pages. The process has been going on for months now, and the consultant is still not done. With our new web scraping capabilities, it's literally a 5 minute job. When it comes to SEO and website management, our technology can give you insights that no other tools in the industry can. If you want to try it out for yourselves, you can purchase a cloudlet below.