Early Schedule for Increasing Data Volume

It’s a bit late, but here’s an early schedule for increasing the amount of data coming into Datafiniti on a daily and monthly basis.  As you may know, we have several types of crawls we run to bring data into our search engine.  Daily crawls help keep data fresh, while comprehensive crawls make sure everything is included in our index.  Now that we’ve brought our new crawling system online, we’re working on scaling up the number of daily and comprehensive crawls so you can enjoy better data from us.

Anyway, here’s an early draft of the schedule!

DATE # OF DAILY CRAWLS # OF COMPREHENSIVE CRAWLS
May 1, 2014 10 5
June 1, 2014 25 10
July 1, 2014 50 25
August 1, 2014 75 50
September 1, 2014 100 75

The exact websites and order in which we include them in our index will depend on a variety of factors, including customer demand, how easy it is to crawl the site, and so on.  The exact rollout schedule is flexible, but we’ll post updates on how we’re doing each month.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Are Republicans Better Tippers?

I stumbled across this little post over at Quartz (via Consumerist) about “Which US States Tip the Most”. After doing a quick glance over the data, something quickly jumped out at me. Many southern and conservative states seem to be bigger tippers. So I thought it would be fun to map the data in the post against the concentration of Republican voters in each state. Using data from Wikipedia, here’s what I came up with:

republican tippers

 

And here’s the data:

State  Average Tip %  Republican
Utah                              16.1                    72.8
Wyoming                              15.4                    68.6
Oklahoma                              16.2                    66.8
Idaho                              16.5                    64.5
West Virginia                              16.7                    62.3
Arkansas                              16.9                    60.6
Alabama                              16.4                    60.6
Kentucky                              16.4                    60.5
Nebraska                              15.5                    59.8
Kansas                              16.2                    59.7
Tennessee                              16.3                    59.5
North Dakota                              15.6                    58.3
South Dakota                              15.3                    57.9
Louisiana                              16.1                    57.8
Texas                              16.3                    57.2
Montana                              16.0                    55.4
Mississippi                              16.5                    55.3
Alaska                              17.0                    54.8
South Carolina                              16.7                    54.6
Indiana                              16.4                    54.1
Missouri                              16.5                    53.8
Arizona                              16.5                    53.7
Georgia                              16.2                    53.3
North Carolina                              16.7                    50.4
Florida                              16.2                    49.1
Ohio                              16.1                    47.7
Virginia                              16.0                    47.3
Pennsylvania                              16.0                    46.6
New Hampshire                              16.2                    46.4
Iowa                              16.1                    46.2
Colorado                              16.5                    46.1
Wisconsin                              15.9                    45.9
Nevada                              16.2                    45.7
Minnesota                              15.7                    45.0
Michigan                              16.4                    44.7
New Mexico                              16.6                    42.8
Oregon                              15.7                    42.2
Washington                              15.9                    41.3
Maine                              16.4                    41.0
Connecticut                              15.6                    40.7
Illinois                              16.5                    40.7
New Jersey                              16.1                    40.6
Delaware                              14.0                    40.0
Massachusetts                              15.7                    37.5
California                              15.5                    37.1
Maryland                              15.8                    35.9
Rhode Island                              15.8                    35.2
New York                              15.8                    35.2
Vermont                              15.5                    31.0
Hawaii                              15.1                    27.8

Of course, my guess is that there are a ton of confounding variables at play here, but there is some trend here.  At the very least, the data is most likely counter-intuitive to many people’s stereotypes!

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

4 Reasons You Should Use JSON Instead of CSV

Do you deal with large volumes of data?  Does your data contain hierarchical information (e.g., multiple reviews for a single product)?  Then you need to be using JSON as your go-to data format instead of CSV.

We offer CSV views when downloading data from Datafiniti for the sake of convenience, but we always encourage users to use the JSON views.  Check out these reasons to see how your data pipeline can benefit from making the switch.

1. JSON is better at showing hierarchical / relational data

Consider a single business record in Datafiniti.  Here’s a breakdown of the fields you might see

  • Business name
  • Business address
  • A list of categories
  • A list of reviews (each with a date, user, rating, title, text, and source)

Now consider a list of these product records.  Each product will have a different number of prices and reviews.

Here’s how some sample data would look like in CSV (Datafiniti link):

And here’s that same data in JSON (Datafiniti link):

The JSON view looks so much better, right?

2. CSV will lose data

If you look closely at the CSV data above, you’ll notice that we have a set number of prices and reviews for each product.  This is because we’re forced to make some cut-off for how many prices and reviews we show.  If we didn’t, each row would have a different number of columns, which would make parsing the data next to impossible.  Unfortunately, many products have dozens or even hundreds of prices and reviews.  This means you end up losing a lot of valuable data by using the CSV view.

3. The standard CSV reader application (Excel) is terrible

Excel is great for loading small, highly-structured spreadsheet files.  It’s terrible at loading files that may have 10,000 rows, 100+ columns, with some of these columns populated by unstructured text like reviews or descriptions.  It turns out that Excel does not follow CSV-formatting standards, so even though we properly encode all the characters, Excel doesn’t know how to read that.  This results in some fields spilling over into adjacent columns, which makes the data unreadable.

4. JSON is easier to work with at scale

Without question, JSON is the de-facto choice when working with data at scale.  Most modern APIs are RESTful, and therefore natively support JSON input and output.  Several database technologies (including most NoSQL variations) support it.  It’s significantly easier to work with within most programming languages as well.  Just take a look at this simple PHP code for working with some JSON from Datafiniti:

Further Reading

Check out these helpful links to get more familiar with JSON:

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Meet the Datafiniti Crew During SXSW

sxsw-2014

If you’re in Austin during SXSW, we’d love to meet you!  Here are a few ways you can meet the team behind the search engine for data:

SXSW Startup Crawl

We’ll be at the Omni Hotel during the annual SXSW Startup Crawl.  Come by our table on the first floor and pick up a Datafiniti t-shirt and sticker!  A few team members will be there to answer your questions.  You can register for the crawl here.

Come By Our Office

Our office is located at 904 West Ave, Ste. 109, Austin TX, 78701.  Let us know if you’d like to swing by!  We’re a short pedicab ride from the convention center and just far enough to feel like you’ve escaped the craziness.

Schedule a Time to Meet

If you’d like to setup a specific time to meet, please contact us.  We’ll be more than happy to find some time to meet and discuss your data needs.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

New 80legs Rollout Schedule

As promised, here’s a detailed rollout schedule for the new, Voltron-powered 80legs:

voltron update   Google Drive

 

Here are some key dates outlined:

  • March 1: We will begin on-boarding 80legs customers onto the new system.  At this time, we’ll also begin on-boarding internal daily crawls for Datafiniti.  Initially, customers will only have access to the new 80legs API.  There will be no website for the new 80legs at this point.
  • April 15: All 80legs customers will be on-boarded to the new 80legs.  We hope to have a website for the new 80legs by this time, but we are still in the process of confirming this delivery date for the website.
  • May 1: The legacy 80legs will be retired and no longer available.

We will provide detailed instructions to affected customers ahead of these dates.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

How we’re building the future of web crawling

Our web crawling platform, 80legs, was built over 4 years ago. A lot has changed since 2008. For starters, “big data” wasn’t even a term then, let alone a cliche. Today, there are a wide variety of technologies available for handling true big data. For this and other reasons, we’ve been secretly working on a massive overhaul to 80legs that promises to deliver the future of web crawling. We call it Voltron.

voltron

Built from the Ground Up

Voltron has been built from the ground up to take advantage of the latest technologies for storing, processing, and delivering massive amounts of data. Here are some quick highlights of the benefits you’ll see:

  • Auto-scaling infrastructure using cloud computing for reduced queue time and faster
    crawling
  • A RESTful API for more seamless integration
  • Moving from Java to Javascript for easier 80app development
  • Faster result delivery using global CDNs

The Rollout Schedule

Much of the alpha development for Voltron has been completed. Internal testing will begin in mid-February. Crawls used for Datafiniti data collection and those run by 80legs customers will be on-boarded in March. We expect to wrap up the on-boarding and final testing in April, with a shutdown of the legacy system by May. With any large software rollout, there may be unexpected hiccups, but we’ll be keeping everyone up-to-date on the latest developments as we progress.

How It Will Affect Datafiniti Users

Voltron will enable a significant increase in the amount of data made available to Datafiniti users. During the course of Voltron’s rollout, we will be scaling the number of “daily crawls” to select websites from 10 to 50 to 100 between February and May. Each daily crawl will collect data from 100,000 URLs from each select website. By May, Datafiniti will have over 10,000,000 business or product records updated each day. This means our customers will enjoy having daily-updated review and pricing data from the websites they are most interested in monitoring.

How It Will Affect 80legs Users

Many 80legs users have rightly felt frustration over crawl performance recently. This will change with Voltron. Intro, Plus, and Premium users will see queue times drop below 1 hour. Dedicated users will see 0 queue time. Crawl speed will improve as well, as several internal bottlenecks are being addressed by Voltron. In addition to performance improvements, we will be providing a new RESTful API and website that should make developing crawls much easier for everyone.

The Future of Web Crawling ..Soon!

We’re very excited to start using Voltron ourselves to feed Datafiniti and providing it to our 80legs customers for their own web crawling. Stay tuned for more updates as the future of web crawling takes shape!

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Updated website & data fixes!

We’ve made some big updates to the Datafiniti website!  You should go check it out now if you haven’t already.  Changes include:

  1. Focus on business and product data.  We’ve decided to focus on daily updates to our business and product data.  We’ll circle back with more information about what we’re doing here, but to provide a quick peek.. We’re going to be providing daily updates across hundreds of websites for business and product listings.  Our website has been updated to reflect this focus.
  2. Removed “People” data search.  People data is still technically available in our database, but since we are seeing the most interest in our business and product data, we’ve decided to focus on just those two data types for now.  People data will likely make a return sometime in the future.
  3. More content around use-cases.  Check out how Datafiniti can be used formonitoring business reviewsmonitoring product reviews, and monitoring product prices.

We’re also chunking through some much-needed data fixes.  Expect to see several attributes around businesses fixed throughout the month.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

How Moving to Austin Improved Our Hiring Process

On May 18, 2013, we announced that we were moving Datafiniti to Austin.  A big reason for our move was the lack of a suitable talent pool from which to hire.  As discussed on Xconomy, there was a lot of skepticism (perhaps surprisingly so) around our reasoning.  We provided a healthy dose of data to back up our belief that the move to Austin would help us out.  Now, after 8 months, I can provide a full breakdown of how our move to Austin benefited our hiring process.

By the numbers

hiring

Above you can see a conversion funnel for our hiring process.  Once a candidate shows interest in Datafiniti, we have 5 steps to recruit and screen them:

  1. Intro call: A 30-45 minute call to tell the candidate a bit about our team and what we do, as well as to learn a few basic things about the person’s experience and personality.
  2. Fizz buzz: A small quiz that tests basic programming concepts.
  3. Coding challenge: A 1-2 day programming challenge that lets us evaluate someone’s coding style and knack for algorithms and data structures.
  4. Interview: An in-person, 1/2-day interview that serves as a deep dive into someone’s technical abilities and cultural fit.  We also give the candidate plenty of opportunity to learn as much as possible about Datafiniti.  Interviews are two-way streets.
  5. Hiring: If a candidate makes it this far, we’ve extended them an offer.  ”Failure” here means they did not accept the offer.

Between May and December, we converted 39 interested candidates into 3 hires.  All of this was done without paying for any recruiting services, including job listings.

Differences between Houston & Austin

A larger, more competitive recruiting environment

This is intuitive.  There are more developers in Austin, but also more companies trying to hire the same people as us.  Because of this, we found more people that matched the listed skill sets for our job openings, but we saw greater drop-offs through the hiring process as compared to what we experienced in Houston.

Our recruiting niche

In Houston, we separated ourselves from other companies hiring developers by pitching ourselves as offering a unique (for Houston) tech startup environment.  That was obviously not unique in Austin, but we found that we still offered a unique, albeit different, environment here.  Simply put: we offer people the chance to work with a small team that works on big problems.  We’re still around 10 people, but we process billions of data points every day.  We also offer developers the opportunity to work with and be exposed to a wide variety of technologies.  We use multiple programming languages, databases, and algorithms.  Our developers get to touch any or all of these tools if they want.  All of that is awesome and exciting to candidates.

Streamlining our hiring process

We realized that our hiring process was taking longer than we wanted.  Obviously we wanted to do a sufficiently thorough job of screening people, but we felt that for certain candidates, every step wasn’t necessary.  It could even be a hindrance to keeping a candidate engaged.  When a candidate had a fairly active public code repo, we sometimes skipped the coding challenge.  When a candidate came from a very technical background, we sometimes skipped the fizz buzz.  We always made sure to test these same concepts during the in-person interview, so we never “degraded” the comprehensiveness of our screening.  We just made it faster when we could.

In summary

Hiring in Austin is harder than it is in Houston.  It’s also more rewarding.  We learned a lot about what made us unique.  We evolved as a team, and everyone learned how to be better recruiters. Most importantly, we’ve constructed a team that will help us serve our customers and grow our business better than we ever have before.

BTW, if you’re interested in finding out more about what makes working with us so awesome, check out our latest job postings and contact us if you feel like you’d fit in.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Datafiniti V2 to go live on Monday

We will be deploying V2 of our website and API Monday morning.  There will be some downtime between 8 am and 1 pm central time on Monday, but things should be good after that.  V1 of our API will no longer be available after this change.

The V2 API provides better functionality, reliability, and performance over the V1 API.  You can view initial documentation for it here: http://datafiniti.github.io/developer.datafiniti.net/.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email

Austin A-List & Upcoming Improvements

So this is exciting.  Within 6 months of moving our team to Austin, we’ve been named by the Austin Chamber of Commerce as one of their 2013 A-List companies.  We were included in the “Emerging” category and selected from a group of 157 companies.  You can read more here: http://impactnews.com/austin-metro/central-austin/%27a-list%27-companies-named/.

It’s always nice to get recognition, but we still have a lot of work to do to achieve our goal of making web data fully accessible.  Earlier this week we gave early access to our V2 API through a new Download App.  The new version of our website and API will be made publicly available next week.  We’re still considering ourselves to be in beta mode, though, so please give us any feedback you have.

In addition to these upcoming releases, we’re also working on a giant upgrade to our back-end architecture to dramatically improve the volume and rate of web content we’re ingesting.  We have a metrics we’re targeting:

  • Crawling individual, content-rich websites at a rate of 100,000 URLs per day.  Each daily website-specific daily crawl will track high-priority businesses and products to provide daily-updated data on reviews and prices.
  • Crawling more than 1 billion URLs every month.  This will enhance our web-wide crawling for more content/data discovery on businesses, people, and products.

Meeting these goals is our focus for the next 2 months, so that starting from January, we’ll be in a great position to provide significantly more up-to-date data to our customers.

Share

  • Facebook
  • Twitter
  • Google Plus
  • LinkedIn
  • Reddit
  • Email