Web crawling can be a very complicated and technical subject to understand. Every web page on the Internet is different from the next, which means every web crawler is different (at least in some way) from the next.
We do a lot of web crawling to collect the data you see in Datafiniti. In order to help our users get a better understanding of how this process works, we’re embarking on an extensive series of posts to provide better insight into what a web crawler is, how it works, how it can be used, and the challenges involved.
Here are the posts we have planned:
- What is Web Crawling?
- Typical use cases for web crawlers
- Different data formats for storing data from a web crawl: CSV, JSON, and Databases
- Techniques for scraping data
- How is a web crawler different from a search engine
- Making sure a web crawler behaves well
- How to use JSON data
- Challenges with scraping data
- Web crawling use cases: collecting pricing data
- Web crawling use cases: collecting business reviews
- Web crawling use cases: collecting product reviews
- Comparison of different web crawlers
So let’s get started!
The Web Page, Deconstructed
We actually need to define what a web page is before we can really understand how a web crawler works. A lot of people think of a web page as what they see in their browser window, which is right, but that’s not what a web page is when a web crawler sees it. So let’s look at a web page like a web crawler.
When you see http://www.cnn.com, you see something like this:
In fact, what you are seeing is the combination of many different “resources”, which your web browser is combining together to show you the page you see. Here’s an abridged version of what happens:
- You type in “http://www.cnn.com”.
- Your browser says ok, let me GET “http://www.cnn.com”.
- CNN’s server says, hey browser, here’s the content for that page. At this point, the browser is only returning the HTML source code of “http://www.cnn.com”, which looks something like this:
- Your browser looks through this code and notices a few things. It notices there are a few style resources needed. It also notices there are several image resources needed.
- The browser now says, I need to GET all of these resources as well.
- Once all the resources for the page are received, it combines them all and displays the page you see.
This is what your browser does. A web crawler can get all the same resources, but if you tell it to GET “http://www.cnn.com”, it will only fetch the HTML source code. That’s all it knows about it until you tell it do something else (possibly with the information in the HTML). By the way, “GET” is the actual technical term for the type of request being made by the crawler and your browser.
A Very Basic Web Crawler
Alright, so now that we understand that requesting “http://www.cnn.com” will only return HTML source code, let’s see what we can do with that.
Let’s imagine our web crawler as a little app. When you start this app, it asks you for what web page you want to crawl. That’s its only input: a list of URLs, or in this case, a list containing 1 URL.
You enter “http://www.cnn.com”. At this point, the web crawler gets the HTML source code of this URL. The HTML is like a very long piece of semi-structured text. It’s going to write that text to a separate file. Just to make it easy on us, the web crawler will also write which URL belongs to this source code.
The whole thing can be visualized like this:
A Slightly More Complicated Web Crawler
So the web crawler can’t do much right now, but it can do the basic thing any web crawler needs to do, which is to get content from a URL. Now we need to expand it to get more than 1 URL.
There are two ways we can do this. First, we can supply more than 1 URL in our URL list as input. The web crawler would then iterate through each URL in this list, and write all the data to the same log file, like so:
Another way would be to use the HTML source code from each URL as a way to find the next set of URLs to crawl. If you look at the HTML source code for any page, you’ll find several references to anchor tags, which look like <a href=””>some text</a>. These are the links you see on a web page, and they can tell the web crawler where other URLs are.
So all we need to do now is extract the URLs of those links and then feed those in as a new URL list to the app, like so:
In fact, this is how web crawlers for search engines typically work. They start with a list of “top-level domains” (e.g., cnn.com, facebook.com, etc.) as their URL list, step through that list, and then crawl to all the links found on the pages they crawl.
So What’s the Purpose of the Web Crawler?
We now have the conceptual understanding of what a typical web crawler does, but it may not be clear what it’s real purpose is.
The ultimate purpose of any web crawler is to collect content or data from the web. “Content” or “data” can mean a wide variety of things, including everything from the full HTML source code of every URL requested, or even just a yes/no if a specific keyword exists on a page. In our next blog post, we’ll cover some common use cases, and expand upon how our conceptual “web crawling app” we’ve described here could be expanded to fit those use cases.
Want to Try Web Crawling Yourself?
If you’re interested in trying to run your web crawls, we recommend using 80legs. It’s the same platform we use to run crawls for Datafiniti.