Upcoming Downtime

80legs will be unavailable on Monday, September 22, from 10 am to 4 pm central US time (GMT-6). This downtime will help us deploy a major update to our back-end infrastructure, which will significantly improve crawling performance.

Please note that all crawls still running at 10 am (GMT-6) September 22 will be canceled. You will be able to run new crawls after 4 pm (GMT-6) September 22.

The update will provide the following benefits:

  1. More consistent crawling speeds – no more slow periods.
  2. More reliable crawling performance – URLs will be more consistently crawled.
  3. Better internal visibility – we’re deploying an internal QA infrastructure that will give us more tools to debug and improve 80legs.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Building a Web Scraper

We briefly touched on how to build a web scraper in our last post on web crawling.  In this post, I’ll go into more detail about how to do this.  When I use the term “web scraper,” I’m referring to a very specific type of web crawler – one that looks at a specific website and extracts data from it.  We do a lot of similar scraping for Datafiniti.

This post is going to cover a lot of ground, including:

  1. Document Object Model (DOM):  An object representation of HTML
  2. Jquery:  A Javascript library that will help you manipulate the DOM
  3. Setting up our environment
  4. Building the Scraper:  Building out the scraper attribute-by-attribute
  5. Running the Scraper:  Using 80legs to run the scraper

The Document Object Model

Before we dive into building a scraper, you’ll need to understand a very important concept – the Document Object Model, aka the DOM.  The DOM is how all modern web browsers look at the HTML makes up a web page.  The HTML is read in by the browser and converted to a more formalized data structure that helps the browser render the content to what you actually see on the site.  You can think of the DOM as a nested collection of HTML data, and can even see this in your browser.  In Chrome, you get this by right-clicking and choosing “Inspect Element”:

buildscraper-1

JQuery

Because the DOM is such an accepted, standardized way of working with HTML, there are a lot of tools available for manipulating it.  One of the most widely used tools is JQuery, a library that enhances Javascript by giving it a ton of DOM-manipulation functionality.

As an example, let’s say we wanted to capture all the most-nested elements in this HTML list (item-1, item-2, and item-3):

<ul class="level-1">
 <li class="item-i">I</li>
 <li class="item-ii">II
  <ul class="level-2">
   <li class="item-a">A</li>
   <li class="item-b">B
    <ul class="level-3">
     <li class="item-1">1</li>
     <li class="item-2">2</li>
     <li class="item-3">3</li>
    </ul>
   </li>
   <li class="item-c">C</li>
  </ul>
 </li>
 <li class="item-iii">III</li>
</ul>

With JQuery, we would just need to do something like this:

var innerList = $html.find(‘ul.level-3 li’);

As you’ll see, using JQuery with the DOM greatly simplifies the web scraping process.

Setting Up Our Development Environment

Now that we understand some of the basic concepts, we’re almost ready to start building our scraper.  Before we can get to the fun stuff, however, we need to setup a development environment.  If you do this, you’ll be able to follow along and build the scraper as you read the article.  Here are the steps you need to take:

  1. Install Git.
  2. Clone the EightyApps repo.
  3. Install the EightyApp tester for Chrome.  Instructions are on the EightyApps rep page.
  4. Register on 80legs.

Building the Web Scraper

Now we’re ready to get started!  Open the BlankScraper.js file, which should be in the repo you just cloned, in a text editor.  In your browser, open http://www.houzz.com/pro/jeff-halper/exterior-worlds-landscaping-and-design, which we’ll use as an example.

For the purposes of this tutorial, we’ll say we’re interested in collecting the following attributes:

  • Name
  • Address
  • City
  • State
  • Postal code
  • Contact

Let’s start with address.  If you right-click on the web page in your browser and select “View Source”, you’ll see the full HTML for the page.  Find where the address (“1717 Oak Tree Drive”) is displayed in HTML.  You can quickly do this by also clicking on the magnifying glass in the upper left corner of the inspect elements box and then clicking on where you actually see the address displayed on the web page.

Note that the address value is stored within a span tag, which has an itemprop value of “streetAddress”.  In JQuery, we can easily capture this value with this:

object.address = $html.find(‘span[itemprop=”streetAddress”]).text();

We can do similar things for city, state, and zip code:

object.city = $html.find(‘span[itemprop=”addressLocality”]).text();
object.state = $html.find(‘span[itemprop=”addressRegion”]).text();
object.postalcode = $html.find(‘span[itemprop=”postalCode”]).text();

Some attributes may be a little harder to get at than others.

Take a look at how the contact for this business (“Jeffrey Halper”) is stored in the HTML.  There isn’t really a unique HTML tag for it.  It’s using a non-unique <dt class=”value”> tag.  Fortunately, JQuery still gives us the tools to find this tag:

object.contact = $html.find(‘dt:contains(“Contact:”)’).next().text();

This code finds the div containing the text “Contact:”, traverses to the next HTML tag, and captures the text in that tag.

Once we’ve built everything out, here’s what the code for the scraper looks like:

You’ll notice that there are only two methods in this code.  The first is called processDocument, which contains all the logic needed to extract data or content from the web page.  The second is parseLinks, which grabs the next set of links to crawl from a web page it’s currently on.  I’ve filled out the parseLinks to make the crawl more efficient.  While we could let the code return every link found, what I’ve provided here results in the crawl focusing on more URLs that actually have the data we want to scrape.

You can use the EightyAppTester Extension in Chrome to test the code you’ve written.  Just copy and paste the code in, and paste in different URLs to test what it grabs from these specific URLs.

You may be wondering where the rest of the web crawling logic is.  Because we’re going to use 80legs to run this scraper, we don’t need worry about anything except processDocument and parseLinks.  80legs will handle the rest for us.  We just handle what to do on each single URL the crawl hits.  This really simplifies the amount of code we have to write for the web scraper.

Running the Scraper

With our scraping code complete, we head over to 80legs, login, and upload the code.  We’ll also want to upload a URL list so our crawler has at least 1 URL to start from.  For this example, http://www.houzz.com/professionals/ is a good start.

With our code and URL list available in our 80legs account, all that’s left is to run the crawl.  We can use Create a Crawl form to select all the necessary settings, and we’re off!

buildscraper-2

The crawl will take some time to run.  Once it’s done, we’ll get one or more result files containing our scraped data.  And that’s it!

Wrapping Up

If you found this post useful, please let us know!  If anything was confusing, please comment, and we’ll do our best to clarify.

More posts will be coming in, so check back regularly.  You can also review our previous posts to get more background information on web crawlers and scrapers.

Further Reading

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Typical Uses For Web Crawlers

In our last post, we provided an introduction to the structure and basic operations of a web crawler.  In this post, we’ll be going into more detail on specific uses cases for web crawlers.  As we do this, we’ll provide some insight into how you could design web crawlers that help each of these use cases.

The One You Use But Don’t Realize It – Search Engines

How terrible would the Internet be without search engines?  Search engines make the Internet accessible to everyone, and web crawlers play a critical part in making that happen.  Unfortunately, many people confuse the two, thinking web crawlers are search engines, and vice versa.  In fact, a web crawler is just the first part of the process that makes a search engine do what it does.

Here’s the whole process:

 

Diagram of How Search Engines Work

 

When you search for something in Google, Google does not run a web crawler right then and there to find all the web pages containing your search keywords.  Instead, Google has already run millions of web crawls and already scraped all the content, stored it, and scored it, so it can display search results instantly.

So how do those millions of web crawls run by Google work?  They’re pretty simple, actually.  Google starts with a small set of URLs it already knows about and stores these as a URL list.  They setup a crawl to go over this list and extract the keywords and links on each URL they crawl from this list.  As each link is found, those URLs are crawled as well, and the crawl keeps going until some stopping condition.

 

Diagram of How Web Crawlers Work

 

In our previous post, we described a web crawler that extracted links from each URL crawled to feed back into the crawl.  The same thing is happening here, but now the “Link Extraction App” is replaced with a “Link and Keyword Extraction App”.  The log file will now contain a list of URLs crawled, along with a list of keywords on each of those URLs.

If you wanted to do this same thing on 80legs, you would just need to use the “LinksAndKeywords” 80app with your crawl.  Source code for this app is available here.

The process for storing the links and keywords in a database and scoring the relevancy so search results can be returned is beyond the scope of our post, but if you’re interested, check out these pages:

The One Developers Love – Scraping Data

If we focus our crawling on a specific website, we can build out a web crawler that scrapes content or data from that website.  This can be useful for pulling structured data from a website, which can then be used for all sorts of interesting analysis.

When building a crawler that scrapes data from a single website, we can provide very exact specifications.  We do this by telling our web crawler app specifically where to look for the data we want.  Let’s look at an example.

Let’s say we want to get some data from this website:

Buckingham Floor Company   Doylestown  PA  US 18914

We want to get the address of this business (and any other business listed on this site).  If we look at the HTML for this listing, it looks like this (click image to expand):

html-scraping

Notice the <span itemprop=”streetAddress”> tag.  This is the HTML element that contains the address.  If we looked at the other listings on this site, we’d see that the address is always capture in this tag.  So what we want to do is configure our web crawler app to capture the text inside this element.

You can do this capturing in a lot of different ways.  The apps you use with 80legs are developed in Javascript, which means you can use JQuery to access the HTML as if it were one big data “object” (called the “DOM”).  In a later post, we’ll go into more detail on the DOM so you can get more familiar with it.  In this case, we would just do a simple command like:

object.address = $html.find('span[itemprop="streetAddress"]').text();

We can do similar commands for all the other bits of data we’d want to scrape on this web page, and all of the other on the website.  Once we do that, we’d get an object for each page like this appearing in our log file:

{
  object.name: "Buckingham Floor Company",
  object.address: "415 East Butler Avenue",
  object.locality: "Doylestown",
  object.region: "PA",
  object.postalcode: "18914",
  object.phone: "(215) 230-5399",
  object.website: "http://www.buckinghamfloor.com"
}

After we generated this log file and downloaded it to our own database or application, we could start analyzing the data contained within.

Any other sort of data scraping will work the same way.  The process will always be:

  1. Identify the HTML elements containing the data you want.
  2. Build out a web crawler app that capture those elements (80legs makes this easy).
  3. Run your crawl with this app and generate a log file containing the data.

We’ll go into more detail on building a full scraper in a future post, but if you want to give it a go now, check out our support page to see how you can do this with 80legs.

As a final note, if you’re interested in business data, we already make this available through Datafiniti.  If you don’t want to bother with scraping data yourself, we already do it for you!

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

What is Web Crawling?

Introduction

Web crawling can be a very complicated and technical subject to understand.  Every web page on the Internet is different from the next, which means every web crawler is different (at least in some way) from the next.

We do a lot of web crawling to collect the data you see in Datafiniti.  In order to help our users get a better understanding of how this process works, we’re embarking on an extensive series of posts to provide better insight into what a web crawler is, how it works, how it can be used, and the challenges involved.

Here are the posts we have planned:

  1. What is Web Crawling?
  2. Typical use cases for web crawlers
  3. Building a web scraper
  4. Different data formats for storing data from a web crawl: CSV, JSON, and Databases
  5. How to use JSON data
  6. Challenges with scraping data
  7. Web crawling use cases: collecting pricing data
  8. Web crawling use cases: collecting business reviews
  9. Web crawling use cases: collecting product reviews
  10. Comparison of different web crawlers

So let’s get started!

The Web Page, Deconstructed

We actually need to define what a web page is before we can really understand how a web crawler works.  A lot of people think of a web page as what they see in their browser window, which is right, but that’s not what a web page is when a web crawler sees it.  So let’s look at a web page like a web crawler.

When you see http://www.cnn.com, you see something like this:

 

Living News   Personal Wellness  Love Life  Work Balance and Home Style   CNN.com

 

In fact, what you are seeing is the combination of many different “resources”, which your web browser is combining together to show you the page you see.  Here’s an abridged version of what happens:

  1. You type in “http://www.cnn.com”.
  2. Your browser says ok, let me GET “http://www.cnn.com”.
  3. CNN’s server says, hey browser, here’s the content for that page.  At this point, the browser is only returning the HTML source code of “http://www.cnn.com”, which looks something like this:
    html_source
  4. Your browser looks through this code and notices a few things.  It notices there are a few style resources needed.  It also notices there are several image resources needed.
  5. The browser now says, I need to GET all of these resources as well.
  6. Once all the resources for the page are received, it combines them all and displays the page you see.

This is what your browser does.  A web crawler can get all the same resources, but if you tell it to GET “http://www.cnn.com”, it will only fetch the HTML source code.  That’s all it knows about it until you tell it do something else (possibly with the information in the HTML).  By the way, “GET” is the actual technical term for the type of request being made by the crawler and your browser.

A Very Basic Web Crawler

Alright, so now that we understand that requesting “http://www.cnn.com” will only return HTML source code, let’s see what we can do with that.

Let’s imagine our web crawler as a little app.  When you start this app, it asks you for what web page you want to crawl.  That’s its only input: a list of URLs, or in this case, a list containing 1 URL.

You enter “http://www.cnn.com”.  At this point, the web crawler gets the HTML source code of this URL.  The HTML is like a very long piece of semi-structured text.  It’s going to write that text to a separate file.  Just to make it easy on us, the web crawler will also write which URL belongs to this source code.

The whole thing can be visualized like this:

What is Web Crawling Illustration 1

A Slightly More Complicated Web Crawler

So the web crawler can’t do much right now, but it can do the basic thing any web crawler needs to do, which is to get content from a URL.  Now we need to expand it to get more than 1 URL.

There are two ways we can do this.  First, we can supply more than 1 URL in our URL list as input.  The web crawler would then iterate through each URL in this list, and write all the data to the same log file, like so:

What is Web Crawling Illustration 2

Another way would be to use the HTML source code from each URL as a way to find the next set of URLs to crawl.  If you look at the HTML source code for any page, you’ll find several references to anchor tags, which look like <a href=””>some text</a>.  These are the links you see on a web page, and they can tell the web crawler where other URLs are.

So all we need to do now is extract the URLs of those links and then feed those in as a new URL list to the app, like so:

What is Web Crawling Illustration 3

In fact, this is how web crawlers for search engines typically work.  They start with a list of “top-level domains” (e.g., cnn.com, facebook.com, etc.) as their URL list, step through that list, and then crawl to all the links found on the pages they crawl.

So What’s the Purpose of the Web Crawler?

We now have the conceptual understanding of what a typical web crawler does, but it may not be clear what it’s real purpose is.

The ultimate purpose of any web crawler is to collect content or data from the web.  “Content” or “data” can mean a wide variety of things, including everything from the full HTML source code of every URL requested, or even just a yes/no if a specific keyword exists on a page.  In our next blog post, we’ll cover some common use cases, and expand upon how our conceptual “web crawling app” we’ve described here could be expanded to fit those use cases.

Want to Try Web Crawling Yourself?

If you’re interested in trying to run your web crawls, we recommend using 80legs.  It’s the same platform we use to run crawls for Datafiniti.

 

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

New User Agent for 80legs

On Thursday, July 17th, we’ll be changing the user-agent for the 80legs crawler form “008″ to “voltron”.

We recognize that changing the user-agent for our web crawler could potentially be controversial, but in this case we feel it’s strongly warranted.  Over 4 months ago, we launched a completely new back-end for 80legs.  Although we still call the system “80legs”, in reality it’s a completely different web crawler.  One of the biggest features of the new crawler is that it’s considerably better about crawling websites respectfully.  In fact, we haven’t received a single complaint from webmasters since we launched the new crawler.

With this change, the 80legs crawler will now only obey robots.txt directives for the “voltron” user-agent.  It will ignore directives for the “008″ user-agent.  We feel this change in behavior is appropriate, as it gives our users the chance to crawl websites inaccessible to the old crawler while still giving webmasters the opportunity to control traffic coming from the new crawler.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Quality That Scales

Seeing the Wall Before It Hits

As we begin to grow the volume of data coming into Datafiniti, data quality is becoming an increasingly important part of our operations. With over 1 million records coming in every day, making quality control (QC) automated is critical.

We recognized this need and the challenges it presented earlier this year. At that time, our data team met to discuss each person’s “ideal” QC platform. We identified the following characteristics as being absolutely essential:

  1. The platform should let a developer run fixes to highly-targeted sections or broad cross-sections of the data.
  2. If a developer implemented new QC logic or updated existing logic, that work should be applied to the entire system. No writing code twice.
  3. Any developer on our team should be able to work on the platform. Easy setup, testing, and deployment was a must.

Ultimately, the goal is scale. Not scale in the sense of the amount of data we can look at. That’s already been done. We needed scale in the sense of how our developers work. With dozens of attributes for millions of records, building out data QC for everything was always going to be hard. We needed to do whatever we could to make it easy on us.

A Vision Realized

Over the next 3 months, our team started building out this idea of a scalable QC platform. It’s now July, and we’re incredibly excited to start rolling out this platform to address and dramatically improve the quality of the data you see in Datafiniti.

This new QC platform addresses all of the goals outlined above. We use a single “base” application that serves as an integrator between a set of QC modules and other aspects of our data operations. Each QC module acts as a set of instructions to validate and fix any issues with a single attribute. So for example, there’s a module for business addresses, one for product names, and so on. Our developers work on individual modules, and “plug” them into the base application. When this happens, everything else in our data pipeline uses this QC logic. Any new QC projects will use it, our import will use it, and even random scripts can use it.

Now We Make It Real

With our QC platform in place, we’ve begun rolling out fixes to various “hot spots” in Datafiniti. Initial projects include:

  • Removing incomplete or corrupted business reviews
  • Fixing inaccurate business names
  • Removing invalid UPC codes

If there are quality improvements you’d like to see, please let us know! Your feedback is invaluable as always.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

A Few Charts to Show Off Our Web Crawling

We’ve begun scaling out the new, Voltron-powered 80legs.  I wanted to take a few minutes to show off a few charts that illustrate our ability to scale web crawling, which ultimately means more data coming into Datafiniti.

1. Here’s how many computers we’ve used for our web crawling (click to expand):

total_nodes

2. Here’s how many computers we’re using at any given time to run web crawls:

active_nodes

3. Here’s how many URLs we’re crawling each second:

urls_crawled

So things are rolling along pretty well on the crawling front.  If we extrapolate the data on the last chart, our current peak monthly web crawling capacity is over 300 million URLs.  And we’re just getting started.

This has already had an impact on how much data is coming into Datafiniti.  I’ll be sharing some pretty charts on imports in the near future.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Early Schedule for Increasing Data Volume

It’s a bit late, but here’s an early schedule for increasing the amount of data coming into Datafiniti on a daily and monthly basis.  As you may know, we have several types of crawls we run to bring data into our search engine.  Daily crawls help keep data fresh, while comprehensive crawls make sure everything is included in our index.  Now that we’ve brought our new crawling system online, we’re working on scaling up the number of daily and comprehensive crawls so you can enjoy better data from us.

Anyway, here’s an early draft of the schedule!

DATE # OF DAILY CRAWLS # OF COMPREHENSIVE CRAWLS
May 1, 2014 10 5
June 1, 2014 25 10
July 1, 2014 50 25
August 1, 2014 75 50
September 1, 2014 100 75

The exact websites and order in which we include them in our index will depend on a variety of factors, including customer demand, how easy it is to crawl the site, and so on.  The exact rollout schedule is flexible, but we’ll post updates on how we’re doing each month.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Are Republicans Better Tippers?

I stumbled across this little post over at Quartz (via Consumerist) about “Which US States Tip the Most”. After doing a quick glance over the data, something quickly jumped out at me. Many southern and conservative states seem to be bigger tippers. So I thought it would be fun to map the data in the post against the concentration of Republican voters in each state. Using data from Wikipedia, here’s what I came up with:

republican tippers

 

And here’s the data:

State  Average Tip %  Republican
Utah                              16.1                    72.8
Wyoming                              15.4                    68.6
Oklahoma                              16.2                    66.8
Idaho                              16.5                    64.5
West Virginia                              16.7                    62.3
Arkansas                              16.9                    60.6
Alabama                              16.4                    60.6
Kentucky                              16.4                    60.5
Nebraska                              15.5                    59.8
Kansas                              16.2                    59.7
Tennessee                              16.3                    59.5
North Dakota                              15.6                    58.3
South Dakota                              15.3                    57.9
Louisiana                              16.1                    57.8
Texas                              16.3                    57.2
Montana                              16.0                    55.4
Mississippi                              16.5                    55.3
Alaska                              17.0                    54.8
South Carolina                              16.7                    54.6
Indiana                              16.4                    54.1
Missouri                              16.5                    53.8
Arizona                              16.5                    53.7
Georgia                              16.2                    53.3
North Carolina                              16.7                    50.4
Florida                              16.2                    49.1
Ohio                              16.1                    47.7
Virginia                              16.0                    47.3
Pennsylvania                              16.0                    46.6
New Hampshire                              16.2                    46.4
Iowa                              16.1                    46.2
Colorado                              16.5                    46.1
Wisconsin                              15.9                    45.9
Nevada                              16.2                    45.7
Minnesota                              15.7                    45.0
Michigan                              16.4                    44.7
New Mexico                              16.6                    42.8
Oregon                              15.7                    42.2
Washington                              15.9                    41.3
Maine                              16.4                    41.0
Connecticut                              15.6                    40.7
Illinois                              16.5                    40.7
New Jersey                              16.1                    40.6
Delaware                              14.0                    40.0
Massachusetts                              15.7                    37.5
California                              15.5                    37.1
Maryland                              15.8                    35.9
Rhode Island                              15.8                    35.2
New York                              15.8                    35.2
Vermont                              15.5                    31.0
Hawaii                              15.1                    27.8

Of course, my guess is that there are a ton of confounding variables at play here, but there is some trend here.  At the very least, the data is most likely counter-intuitive to many people’s stereotypes!

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

4 Reasons You Should Use JSON Instead of CSV

Do you deal with large volumes of data?  Does your data contain hierarchical information (e.g., multiple reviews for a single product)?  Then you need to be using JSON as your go-to data format instead of CSV.

We offer CSV views when downloading data from Datafiniti for the sake of convenience, but we always encourage users to use the JSON views.  Check out these reasons to see how your data pipeline can benefit from making the switch.

1. JSON is better at showing hierarchical / relational data

Consider a single business record in Datafiniti.  Here’s a breakdown of the fields you might see

  • Business name
  • Business address
  • A list of categories
  • A list of reviews (each with a date, user, rating, title, text, and source)

Now consider a list of these product records.  Each product will have a different number of prices and reviews.

Here’s how some sample data would look like in CSV (Datafiniti link):

And here’s that same data in JSON (Datafiniti link):

The JSON view looks so much better, right?

2. CSV will lose data

If you look closely at the CSV data above, you’ll notice that we have a set number of prices and reviews for each product.  This is because we’re forced to make some cut-off for how many prices and reviews we show.  If we didn’t, each row would have a different number of columns, which would make parsing the data next to impossible.  Unfortunately, many products have dozens or even hundreds of prices and reviews.  This means you end up losing a lot of valuable data by using the CSV view.

3. The standard CSV reader application (Excel) is terrible

Excel is great for loading small, highly-structured spreadsheet files.  It’s terrible at loading files that may have 10,000 rows, 100+ columns, with some of these columns populated by unstructured text like reviews or descriptions.  It turns out that Excel does not follow CSV-formatting standards, so even though we properly encode all the characters, Excel doesn’t know how to read that.  This results in some fields spilling over into adjacent columns, which makes the data unreadable.

4. JSON is easier to work with at scale

Without question, JSON is the de-facto choice when working with data at scale.  Most modern APIs are RESTful, and therefore natively support JSON input and output.  Several database technologies (including most NoSQL variations) support it.  It’s significantly easier to work with within most programming languages as well.  Just take a look at this simple PHP code for working with some JSON from Datafiniti:

Further Reading

Check out these helpful links to get more familiar with JSON:

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email