What is data scraping?

Why you need it and how to do it


What is data scraping?

Data scraping (or web scraping) is the process of collecting information from a publicly accessible website into a local file/database.

Why would I use data scraping?

  • Gives you ownership of the data which could change regularly, allowing you to look back at data from a given point in time.
  • Allows you to gather disparate data from multiple sources and combine them into data that is useful to you.
  • Expedites research, doing the hard work for you of visiting hundreds or thousands of web pages to retrieve data.
  • Facilitates the generation of data from one form to another or from one system to another.

What would I use data scraping for?

  • Research for web content or BI.
  • Retrieving prices for travel booking or price comparison services.
  • Finding sales leads or for market research.
  • Sending product data from an eCommerce site to an online vendor.
  • Identifying trends in your customers via social media.
  • Building a database of historical data for analysis purposes.

Who uses data scraping?

A huge amount of companies currently use web scraping to support their business. One of the largest and most well-known is Google. They scrape every site they can to find links to other sites, keywords and all manner of other data which they then index to drive their search system. Walmart were scraping Amazon millions of times a day to price match up until January 2017 (source). Bentley University scraped Kickstarter to analyse what makes a successful campaign (source).

Are they evil?

No. Well, at least not for using data scraping. Data scraping can obviously be used for immoral or illegal purposes. Unfortunately, this is true of most technology. At Chattering Monkey, we will only scrape data that is publicly available on the internet. This keeps you (and us) well out of the reach of GDPR. We will not scrape lists of contact details for the purposes of selling on to spammers or scammers.

How do I scrape data?

Excel

The easiest way to do it yourself is to use Excel if you’re using the Windows version. Using Excel is an excellent solution if the data you need is displayed on the web page in a table, is already in a format that is useful to you and you have the resources to run the scrape manually.

Browser plugins

There are various browser plugins that will allow you to scrape data from a web page to a file. Some are free, some allow a certain number of scrapes before paying and some are fully fledged commercial products. One we’re familiar with and like is Data Scraper. It’s generally much easier to select the data you wish to retrieve than with Excel, and they often offer more advanced features. However, much like Excel, the scraping must be run manually, and the processing/cleaning of the data is very limited.

Desktop Software

It is possible to purchase software to scrape data that you can configure and run on your machine. One such example is WebHarvy. This offers far more advanced features such as scheduling and intelligent scraping. It also provides the ability to scrape anonymously (via proxy). This is a good option if you can perform all the data cleaning, normalisation and analysis yourself. It is only possible to run one scrape at a time which means it will be slow to retrieve hundreds or thousands of pages.

Disadvantages

Cost

The main issue involved with any of the three approaches discussed above is unmanaged cost. Even using a free tool (a Chrome plugin or Excel, assuming you already have a licence) will involve learning how to configure the software to retrieve the data you want. After this, you still need to process the data into a format that is useable for you; this will require either a manual process performed after each scrape or the development of software to perform the task for you. Also, web pages change regularly (trust us on this one!). Every time a web page changes structure or styling you will probably have to reconfigure your scraper. You also won’t know the page has changed until you start getting strange looking data to use.

Security

The first two methods above assume that the websites you are scraping are not going to block you after you’ve hit them hundreds of times. WebHarvy allows you to use any number of proxy servers (what is a proxy server?). This makes the web page request look like it’s coming from another location, reducing the chance that you’ll be blocked. However, this involves further cost as you’ll need to sign up to a decent VPN service (VPN service reviews) and further configuration/maintenance.

Ask us!

You could ask us to do it for you. We’ve already gone through the pain for you. We’ve worked out how to deal with pages changing regularly, how not to be blocked from sites. One particular site, large enough to employ a very highly educated team to discourage data scraping (of data in the public domain) is a constant thorn in our side! We also have the tools, knowledge and experience to clean, normalise and analyse your data so it gets to you ready to use with no headaches.

All you have to do is tell us which data you’d like and we’ll get it to you.