Classifying Websites with Neural Networks

Classifying Web Pages is Tricky

At Datafiniti, we have a strong need for converting unstructured web content into structured data.  For example, we’d like to find a page like:

Pants City!

and do the following:

  1. Determine that this web page is selling some sort of product
  2. Identify the correct name, price, and other attributes of the product

Both of these are hard things for a computer to do in an automated manner. While it’s easy for you or me to realize that the above web page is selling some jeans, a computer would have a hard time making the distinction from the above page from either of the following web pages:
Pants City - Jeans

Or

Wikipedia: Jeans

Both of these pages share many similarities to the actual product page, but also have many key differences. The real challenge, though, is that if we look at the entire set of possible web pages, those similarities and differences become somewhat blurred, which  means hard and fast rules for classifications will fail often. In fact, we can’t even rely on just looking at the underlying HTML, since there are huge variations in how product pages are laid out in HTML.

Our Solution: Neural Networks

While we could try and develop a complicated set of rules to account for all the conditions that perfectly identify a product page, doing so would be extremely time consuming, and frankly, incredibly boring work. Instead, we can try using a classical technique out of the artificial intelligence handbook: neural networks.

Here’s a quick primer on neural networks. Let’s say we want to know whether any particular mushroom is poisonous or not.  We’re not entirely sure what determines this, but we do have a record of mushrooms with their diameters and heights, along with which of these mushrooms were poisonous to eat, for sure.  In order to see if we could use diameter and heights to determine poisonous-ness, we could set up the following equation:

A * (diameter) + B * (height) = 0 or 1 for not-poisonous / poisonous

We would then try various combinations of A and B for all possible diameters and heights until we found a combination that correctly determined poisonous-ness for as many mushrooms as possible.

Neural networks provide a structure for using the output of one set of input data to adjust A and B to the most likely best values for the next set of input data. By constantly adjusting A and B this way, we can quickly get to the best possible values for them.

In order to introduce more complex relationships in our data, we can introduce “hidden” layers in this model, which would end up looking something like:

For a more detailed explanation of neural networks, you can check out the following links:

Our Implementation

In our product page classifier algorithm, we setup a neural network with 1 input layer with 27 nodes, 1 hidden layer with 25 nodes, and 1 output layer with 3 output nodes. Our input layer modeled several features, including:

  • Price found on page
  • Image URL found on page
  • # of clickable images adjacent to price values
  • Keywords found in prominent positions (e.g., product detail, description, etc.)

Our output layer had the following:

  • Probability of being a product page
  • Probability of being a product category page (e.g., the second example page above)
  • Probability of being some other page

Our algorithm for the neural network took the following steps:
Diagram 1

The ultimate output is two sets of input layers (T1 and T2), that we can use in a matrix equation to predict page type for any given web page. This works like so:

classifying-webpages-diagram2

The Results

So how did we do? In order to determine how successful we were in our predictions, we need to determine how to measure success. In general, we want to measure how many true positive (TP) results as compared to false positives (FP) and false negatives (FN). Conventional measurements for these are:

  • Precision (P) = TP / (TP + FP)
  • Recall (R) = TP / (TP + FN)
  • F-Score = 2 * P * R / (P + R)

Our implementation had the following results:

  • P = 0.929
  • R = 0.904
  • F-Score = 0.916

These scores are just over our training set, of course. The actual scores on real-life data may be a bit lower, but not by much. This is pretty good!  We should have an algorithm on our hands that can accurately classify product pages about 90% of the time.

Extracting Product Data

Of course, identifying product pages isn’t enough.  We also want to pull out the actual structured data!  In particular, we’re interested in product name, price, and any unique identifiers (e.g., UPC, EAN, & ISBN).  This information would help us fill out our product search.

We don’t actually use neural networks for doing this.  Neural networks are better-suited toward classification problems, and extracting data from a web page is a different type of problem.  Instead, we use a variety of heuristics specific to each attribute we’re trying to extract.  For example, for product name, we look at the <h1> and <h2> tags, and use a few metrics to determine the best choice.  We’ve been able to achieve around a 80% accuracy here.  We may go into the actual metrics and methodology for developing them in a separate post!

More to Come

We feel pretty good about our ability to classify and extract product data. The extraction part could be better, but it’s steadily being improved. In the meantime, we’re also working on classifying other types of pages, such as business data, company team pages, event data, and more.
As we roll-out these classifiers and data extractors, we’re including each one in our crawl of the entire Internet. This means that we can scan the entire Internet and pull out any available data that exists out there. Exciting stuff!

Share:
  • Facebook
  • Twitter
  • LinkedIn
  • Email
  • Google Plus
  • Reddit
  • Add to favorites

We’re Moving to Austin!

Google Maps

It’s a secret no more – on June 1st, we’ll be closing down our Houston office and moving to Austin!  We’re extremely excited about the move and becoming an active member of the Austin startup and technology communities.

Making a Splash in Austin

We actually already have an office in Austin, where ½ of our team has been working for the past 6 months.  It’s a downtown office in a mixed-use building, which gives it a cozy feel – perfect for us.  Beyond the move itself, we’re looking forward to leveraging the passion and expertise of the Austin community to help us generate new ideas on solving the problems we’re working on.  We’ve been asked to present at the resurgent AHUG and two of our developers want to restart the Erlang meetup group.  You can also count on seeing us at local startup events and maybe even next year’s SXSW Startup Crawl.

We’re also planning on putting together a “Welcome to Austin” party sometime in July, so be on the lookout for that!  Our building has a pool, so it’ll be perfect for the weather :)

Why We Moved

I really wanted to tackle the reasons behind our move in this post.  A lot of folks, in Houston and Austin, have been very curious about it.  Our reason is simple:  we didn’t find the people we wanted to hire in Houston and we did find them in Austin.

Last year we made the decision to buckle down on building out Datafiniti, which meant hiring a full-time product team consisting of a UI developer and UX designer.  We had a very good understanding that we couldn’t just hire any UI/UX folks.  They had to have the talent and experience to take a very complex concept (a search engine for data) and make it approachable and usable (we want anyone to be able to use Datafiniti).  We initially tried to fill these roles in Houston and used several recruiting techniques:

  1. Leveraged our personal networks
  2. Posted jobs on mainstream and targeted web sites (e.g., Dribbble)
  3. Posted messages in local meetup groups
  4. Posted to Hacker News and Facebook groups

For starters, we didn’t get a lot of the interest in either position.  Of the candidates that did apply, they generally fell into these two categories:

  1. UI developers with limited programming experience centered around lightweight consumer apps
  2. Graphic designers (not UX designers) mostly with experience in print media

We spent 6 months on our Houston recruiting effort with disappointing results.  After 3 months of growing frustration and delaying our product timeline, we started recruiting in Austin as well.  We emailed our personal contacts in Austin for references, posted job listings, and messaged local Facebook groups.  In 3 months, we found two excellent people for our product team.  In comparison to the Houston candidates we saw, the guys we hired in Austin had the following credentials:

  1. Designed interfaces for one of our primary competitors
  2. Left an analyst role to form his own startup
  3. Experience with all cutting-edge front-end technologies and coding practices

In half the time, our Austin search yielded better, more relevant candidates than our Houston search.  Truth be told, we had experienced similar frustrations hiring in Houston before this one.  After this, we began to seriously consider a permanent move to Austin.  As we look to fill in other jobs at Datafiniti, we’re even finding people willing to relocate from California to Austin.  We never saw this when we were in Houston.

Responses to Our Move

After news broke about our move, there was a lot of discussion in the Houston community.  We stayed out of the conversation on the Facebook groups, but thought we owed the community a detailed response.

Here are a few themes worth addressing:

Startups should grow the local talent base:

“Startups should be willing to grow their own talent base if they can’t find the perfect peg for their hole.”

“We have to be careful using ‘can’t find talent’ as an excuse, when there are possibly other issues that need to be addressed.  Part of the DNA of a startup is being flexible, agile and innovative.”

It’s certainly true that startups should grow their own talent base as well as the local scene.  In fact, we did this ourselves by starting the Big Data Houston group, which is now the largest such group in the city.  We also actively participated in local hackathons and startup events.  I see three issues with the assertions above:

  1. One of our hiring philosophies is to only hire people that are adding or creating new value for our team, rather than simply “filling in value”, as I like to put it.  This means each person we hire is bringing something new to the team and making us better, instead of simply keeping things at status quo.  This approach makes it hard to hire inexperienced people but it pays off in terms of organizational quality and growth.
  2. Our active efforts to build the local talent base only went so far.  We definitely succeeded in raising the profile of “big data” in Houston, but our meetups only had an average attendance of 20 – 25 people.  This seems small compared to the size of the city.  This strategy is also a long bet that takes time to payoff.  Too long for us.
  3. Startups should take the easier path whenever possible.  Why would we invest time into training people if we can hire people with the right experience elsewhere?  While our move to Austin has some cost, it doesn’t figure to be very painful and the benefit of a larger talent pool (among others) should outweigh the short-term costs significantly.

Hiring fresh grads is the key:

“What efforts are being made to actively pair up development talent from all of the universities and community colleges in Houston with Startups and/or Entrepreneurs?  That is the perfect intersection to find people who are learning to develop and get them involved.”

“If you can sell to an 18 yr old how much cooler [working at a startup] is, how a bigger impact could be made, that’s half the battle. We have 3 college interns working with us, UNPAID.”

In 99% of cases, trying to hire a fresh graduate would be a terrible strategy for us.  Here’s why:

  1. The work we do cannot and should not be done by unpaid interns.  There does seem to be a tendency to over-generalize strategies in these startup discussions.  Unpaid interns may work for some.  It would absolutely not work for us.  It also conflicts with the hiring philosophy I mentioned earlier.
  2. The 1% of fresh graduates that would be a good fit for us already have job offers by their junior year by the likes of Google & Facebook.  Just ask the students at Rice.  Google can afford to take these chances, but we can’t.

Houston has a large talent base that we failed to tap into:

“I’ve personally worked with people at JPMorgan, Macquarie, Conoco Phillips, and PROS who are all doing amazingly interesting stuff with both SQL and NoSQL databases. I’ve also worked with a handful of really good UX consultants here in town. It’s a shame they weren’t able to find what they were looking for here.”

This is a big point that’s brought up every time our move is discussed.  People assume that because Houston is the 4th largest city in the US, it must have a huge developer talent pool.  Unfortunately, this assumption is most likely wrong, and I have data to back that up (I am the CEO of a data search company, after all).

Here are some counts for people on LinkedIn matching keywords that would interest us:

KEYWORD # IN HOUSTON # IN AUSTIN % DIFFERENCE
Software Developer 10,301 12,485 21.2%
Software Engineer 21,148 24,209 4.6%
Programmer 8,658 7,747 -10.5%
Java Developer 3,111 5,089 63.6%
UI Developer 631 1,488 135.8%
UX Designer 204 763 244.1%
Hadoop 145 499 274.0%
NoSQL 55 204 270.9%
Erlang 33 44 33.3%

(One day you’ll be able to pull a more accurate data set than this from Datafiniti’s People Search.)

I also looked at the companies that LinkedIn highlights as the top 11 companies for “software engineer”.  Here’s how Houston and Austin stack up: The data above supports our hypothesis that while Austin’s overall talent pool may be smaller, it has a larger talent pool for the skills that interest us.  This disparity grows as we get more specific in the search.

TOP COMPANIES IN HOUSTON TOP COMPANIES IN AUSTIN
Hewlett-Packard (286) Dell (635)
Halliburton (188) IBM (362)
BMC Software (124) Hewlett-Packard (204)
Schlumberger (103) University of Texas (192)
BP (100) National Instruments (154)
JPMorgan Chase (97) AMD (123)
University of Houston (88) Oracle (104)
ExxonMobil (77) General Motors (102)
Baker Hughes (73) HomeAway.com (97)
PROS (71) Paypal (88)
Avanade (62) BMC Software (88)

The tech talent in Austin tends to work more for tech companies, rather than for other sorts of companies.  This is totally expected, given the difference in each city’s economy, but is a very important point.  Tech talent working as support for other business units is very different from tech talent working as the primary value creators in a company.  The best talent in any industry will flock to where those people are the primary value creators for companies, rather than auxiliary departments.

There are a lot of talented people here, but they don’t have the right mindset.

“The culture is not here, but it’s ridiculous to claim that the talent is not here.”

I may be quibbling over semantics here, but it’s an important point.  I consider a person’s mentality (culture) part of his/her “talent”.  We’ve found over and over again that people that are willing to risk being part of a startup are more likely to be self-motivated when it comes to learning new skills, techniques, and technologies.  The same mindset that gets people interested in startups seems to correlate highly with a richer skill set, especially when it comes to software development.  Almost all of our hires have been people that were fed up with corporate culture and wanted to make a bigger impact at a smaller company.  These same people all took it upon themselves to work on side projects and develop new skills while they found that opportunity with us.

Houston Is on the Right Track

I don’t want this post to be a complete slam of Houston.  The fact is we did find some awesome folks here, so clearly they exist.  The startup and technology community is also growing and moving in the right direction.  As I write this, there’s an amazing hackathon going on.  Last week, we had our biggest turnout ever at Big Data Houston.  There are awesome co-working spaces popping up.  The startup/tech calendar is packed like it’s never been before.  It’s a very exciting time for the Houston scene.

The Road Ahead

road-ahead

Our move to Austin is not without sacrifice, but we’re incredibly excited to write our company’s next chapter in this great city.  We’re really looking forward to having the same sort of community involvement in Austin that we had in Houston.  You’ll see us at many of the local meetups, you’ll see us participating in local technology discussions, and hopefully you’ll see us in the press as our company grows.  But most importantly, you’ll find us in our new digs at 904 West Avenue, building the world’s first search engine for data and enjoying life!

Share:
  • Facebook
  • Twitter
  • LinkedIn
  • Email
  • Google Plus
  • Reddit
  • Add to favorites