More Reliable Web Crawling
Version one of Voltron already provided you with the ability to crawl an impressive number of URLs per second. However, with this new – and improved (a cliché, we know, but true) – web scraper, you’ll not only be able to crawl more URLs, you’ll also be able to do so more reliably. Arguably one of the main and most important features of MAULER, and the one that allows for increased speed and reliability, is the internal structure of the crawler. Previously, crawls were run through external nodes, meaning they could be viewing web pages from international IP addresses, which can – and often did – skew the results. Now every URL will be accessed by an internal HTTP request, meaning that we have much more control over what the web scraper sees when it crawls the raw HTML. This also means that what we and those of you who use Voltron see in the tester (a Chrome extension we provide) should almost always exactly match what the web crawler actually sees.
In addition to writing our own HTTP class to make web requests, we’ve also created our own extended version of Cheerio, which is a lighter weight version of jQuery that only uses core jQuery features in order to more quickly perform DOM traversal (up to 8x the speed of JSDOM, according to the Cheerio Github page). Though the project uses all Cheerio functionality, we created our extended version, aptly named Captain Crunch, in order to make use of certain jQuery functions that we commonly use in web scraping, including .each, .not, .makeArray, .filter, and .prop. The code for Captain Crunch will be available soon on our Github page.
As mentioned before, MAULER began as a test of whether Phantom JS and Scraper JS would provide us with greater web crawling capabilities. However, the decision was made to leave these two libraries behind and create our own HTTP class and extended Cheerio library in order to provide a dramatic increase in consistency and speed. While experimenting with these two libraries, we found that they required the creation of a window object, something which can take up to a valuable five seconds. Through the creation of our own HTTP class and the use of Captain Crunch, we are able to avoid this and, thus, achieve the faster and more reliable results we wanted.
Faster Web Crawling
Another step up that MAULER provides is an accelerated and more dependable transporter. Responsible for the part of Voltron that composes the results files for each crawl, this component of Voltron used to be somewhat of a bottleneck, as it only allowed one Redis key to write to it at a time. The new transporter on the other hand allows multiple keys to push data to Redis at the same time and only starts creating results files once the Redis list has reached 10 mb of data. This will allows crawls to run through the entire Voltron system much faster now as other components, such as the crawlers themselves and the URL distributor, no longer have to wait as long for the transporter to finish its work.
Faster Bug Fixes
The third and final main new feature of MAULER is the QA environment. This has been a feature that our whole team at Datafiniti – and we’re sure you as well – have been eager to have. Due to crawls being run internally with this new version of our web scraper and much more custom error handling, it will be much easier to find bugs (and fix them!), much faster for us to roll out new features, and much simpler to do testing on crawls and issues.
We’re beyond excited to be bringing the improved Voltron web crawler to you and hope you are excited to try it out. Please reach out to us at firstname.lastname@example.org if you have any questions regarding the new crawler and its capabilities or would like a demonstration of our new 80legs product.