We briefly touched on how to build a web scraper in our last post on web crawling. In this post, I’ll go into more detail about how to do this. When I use the term “web scraper,” I’m referring to a very specific type of web crawler – one that looks at a specific website and extracts data from it. We do a lot of similar scraping for Datafiniti.
This post is going to cover a lot of ground, including:
- Document Object Model (DOM): An object representation of HTML
- Setting up our environment
- Building the Scraper: Building out the scraper attribute-by-attribute
- Running the Scraper: Using 80legs to run the scraper
The Document Object Model
Before we dive into building a scraper, you’ll need to understand a very important concept – the Document Object Model, aka the DOM. The DOM is how all modern web browsers look at the HTML makes up a web page. The HTML is read in by the browser and converted to a more formalized data structure that helps the browser render the content to what you actually see on the site. You can think of the DOM as a nested collection of HTML data, and can even see this in your browser. In Chrome, you get this by right-clicking and choosing “Inspect Element”:
As an example, let’s say we wanted to capture all the most-nested elements in this HTML list (item-1, item-2, and item-3):
With JQuery, we would just need to do something like this:
var innerList = $html.find(‘ul.level-3 li’);
As you’ll see, using JQuery with the DOM greatly simplifies the web scraping process.
Setting Up Our Development Environment
Now that we understand some of the basic concepts, we’re almost ready to start building our scraper. Before we can get to the fun stuff, however, we need to setup a development environment. If you do this, you’ll be able to follow along and build the scraper as you read the article. Here are the steps you need to take:
- Install Git.
- Clone the EightyApps repo.
- Install the EightyApp tester for Chrome. Instructions are on the EightyApps rep page.
- Register on 80legs.
Building the Web Scraper
Now we’re ready to get started! Open the BlankScraper.js file, which should be in the repo you just cloned, in a text editor. In your browser, open http://www.houzz.com/pro/jeff-halper/exterior-worlds-landscaping-and-design, which we’ll use as an example.
For the purposes of this tutorial, we’ll say we’re interested in collecting the following attributes:
- Postal code
Let’s start with address. If you right-click on the web page in your browser and select “View Source”, you’ll see the full HTML for the page. Find where the address (“1717 Oak Tree Drive”) is displayed in HTML. You can quickly do this by also clicking on the magnifying glass in the upper left corner of the inspect elements box and then clicking on where you actually see the address displayed on the web page.
Note that the address value is stored within a span tag, which has an itemprop value of “streetAddress”. In JQuery, we can easily capture this value with this:
object.address = $html.find(‘span[itemprop=”streetAddress”]).text();
We can do similar things for city, state, and zip code:
object.city = $html.find(‘span[itemprop=”addressLocality”]).text();
object.state = $html.find(‘span[itemprop=”addressRegion”]).text();
object.postalcode = $html.find(‘span[itemprop=”postalCode”]).text();
Some attributes may be a little harder to get at than others.
Take a look at how the contact for this business (“Jeffrey Halper”) is stored in the HTML. There isn’t really a unique HTML tag for it. It’s using a non-unique <dt class=”value”> tag. Fortunately, JQuery still gives us the tools to find this tag:
object.contact = $html.find(‘dt:contains(“Contact:”)’).next().text();
This code finds the div containing the text “Contact:”, traverses to the next HTML tag, and captures the text in that tag.
Once we’ve built everything out, here’s what the code for the scraper looks like:
You’ll notice that there are only two methods in this code. The first is called processDocument, which contains all the logic needed to extract data or content from the web page. The second is parseLinks, which grabs the next set of links to crawl from a web page it’s currently on. I’ve filled out the parseLinks to make the crawl more efficient. While we could let the code return every link found, what I’ve provided here results in the crawl focusing on more URLs that actually have the data we want to scrape.
You can use the EightyAppTester Extension in Chrome to test the code you’ve written. Just copy and paste the code in, and paste in different URLs to test what it grabs from these specific URLs.
You may be wondering where the rest of the web crawling logic is. Because we’re going to use 80legs to run this scraper, we don’t need worry about anything except processDocument and parseLinks. 80legs will handle the rest for us. We just handle what to do on each single URL the crawl hits. This really simplifies the amount of code we have to write for the web scraper.
Running the Scraper
With our scraping code complete, we head over to 80legs, login, and upload the code. We’ll also want to upload a URL list so our crawler has at least 1 URL to start from. For this example, http://www.houzz.com/professionals/ is a good start.
With our code and URL list available in our 80legs account, all that’s left is to run the crawl. We can use Create a Crawl form to select all the necessary settings, and we’re off!
The crawl will take some time to run. Once it’s done, we’ll get one or more result files containing our scraped data. And that’s it!
If you found this post useful, please let us know! If anything was confusing, please comment, and we’ll do our best to clarify.
More posts will be coming in, so check back regularly. You can also review our previous posts to get more background information on web crawlers and scrapers.