Contact our sales team

Author
Author
Yauhen Zaremba
Director of Demand Generation

Yauhen is the Director of Demand Generation at PandaDoc. He’s been a marketer for 10+ years, and for the last five years, he’s been entirely focused on the electronic signature, proposal, JPEG to PDF converter, and document management markets. Yauhen has experience speaking at niche conferences where he enjoys sharing his expertise with other curious marketers. And in his spare time, he is an avid fisherman and takes nearly 20 fishing trips every year.

Services
from 0.8 USD
from 0.8 USD
from 1.3 USD
from 80 USD
Author
Author
Yauhen Zaremba
Director of Demand Generation

Yauhen is the Director of Demand Generation at PandaDoc. He’s been a marketer for 10+ years, and for the last five years, he’s been entirely focused on the electronic signature, proposal, JPEG to PDF converter, and document management markets. Yauhen has experience speaking at niche conferences where he enjoys sharing his expertise with other curious marketers. And in his spare time, he is an avid fisherman and takes nearly 20 fishing trips every year.

How to

Web Scraping And GDPR: A Look Into The Legality of Gathering Website Data

web scraping and gdpr

There’s an awful lot of data on the internet. Always good to start with a genuinely profound insight, don’t you think?

Anyway, time was, a company wanting to avail itself of large chunks of data could just grab a massive tranche with something close to gay abandon. These days, with data protection being very much de rigueur, it’s possible to get into all kinds of trouble with less than careful data acquisition.

Let’s take a look at what the technique known as ‘web scraping’ entails and how we can use it in the era of data protection.

Web Scraping - What Is It?

Simply, web scraping is the process of getting data from a website. There are many reasons why one might do it, from acquiring personal details to just copying some content into Word (perhaps with the wish then to convert Word to PDF online) that you want to refer to. 

It can be done manually, involving copy and paste, but not often. The main reason someone might use a manual method is that it can circumvent automated defense techniques. 

The overwhelming majority of web scraping is automatic. Here are some of the techniques by which this is done.

Google Sheets

web scraping Google Sheets

This is a popular technique for the obvious reason that Google is so widely used. Once in Sheets, the user can deploy the IMPORTXML (,) function to scrape data from a site. The really handy thing about Google Sheets is that you can use it to check if your website is scrape-proof.

Parsing

This technique can be HTML or DOM parsing. The former is a fast JavaScript method that gleans information like text and links. The latter focuses on XML file content to harvest data on page structure, giving information such as the layout and content of nodes. You can use XPath then to acquire whole web pages.

Vertical Aggregation

This method is reserved for companies that can bring high levels of computing power to bear on the task. These agents target specific verticals, and bots are dispatched to round up all the pertinent data. The bots’ effectiveness is then measured based on the quality of the data mined.

XPath

Also known as XML Path Language, this method works on XML documents to discover data nodes. As mentioned above, when used in conjunction with DOM parsing, XPath can be a very powerful technique for acquiring whole web pages and publishing them on another site.

So, these are the main means by which we can perform web scraping. Let’s turn now to the data protection climate.

General Data Protection Regulation (GDPR)

This is the legislation that’s given data professionals across the globe some serious headaches. In the interests of fairness, it should be added that GDPR has also delivered some much-needed data privacy and peace of mind to individuals and companies beyond measure. But, if you’re in the data business, you’ll have encountered some major restrictions, courtesy of GDPR.

GDPR

Broadly, GDPR sets out the rules by which data concerning the citizens of the EU and the UK can be handled. 

What are the main tenets? Your data acquisition has to be lawful, fair, and transparent. There has to be a clear purpose to it. Data acquisition levels should be just sufficient for this purpose. The data should be accurate. It should be kept only for as long as genuinely needed. Lastly, the data handler should assume responsibility for the security of that data. 

Okay, that’s quite a lot to take on board. However, probably the main GDPR web scraping regulation takeaway is that nobody can assume permission for data acquisition and processing. It has to be explicit. 

Why does this matter? Because when a web scrape involves the personal identifiable information (PII) of EU or UK citizens, you are prohibited from doing anything with that data unless you seek and acquire permission from those citizens (see below). Acquiring permission makes what was a fairly straightforward process a singularly tricky one. 

Let’s now turn to what you can do to ensure you don’t fall foul of the GDPR’s provisions.

GDPR-Friendly Web Scraping

Here are the ways you can justify personal data scraping.

1. Consent

So, this is the means mentioned above. If you have gained consent from the individual to carry out that specific procedure, you’re good to go. It obviously speeds things up to have an online authorization, so it pays to know how to create an electronic signature in Word, for instance. 

It has to be said, though, that this can be very labor-intensive and can therefore slow the process to a crawl. The good news is that there are other ways, but they come with provisos.

2. Contract

If you have a contract with the individual and part of that contract requires you to process their data, then you are okay to proceed with a scrape, as long as your need to access data is detailed clearly in the contract’s terms.

signing consent

If you have a contract with the individual and part of that contract requires you to process their data, then you are okay to proceed with a scrape, as long as your need to access data is detailed clearly in the contract’s terms.

3. Legal Obligation

If data access is necessary in order to satisfy a legal obligation, then you can conduct a web scrape. But you should inform the subject in this case. 

4. Vital Interests

If there’s an overwhelming need to access data, say, to save somebody’s life, then you may be able to claim justification in your web scraping. You may need a good lawyer, however. 

5. Public Interest

If access is demonstrably in the public interest or part and parcel of your duties as a public official, then you may be justified in performing a scrape. Again, you should inform the subject in this case. 

6. Legitimate Interest

This one’s a shady one. If you can demonstrate that your legitimate interests were served by data access, then you may be okay to scrape. You will have to take great pains to defining business processes involved in order to provide a complete picture of the need. 

However, if it can be shown that the individual’s fundamental rights or interests were poorly served by this action, then you could be in trouble. Best course? Lawyer up!

Questions to Ask Yourself

Before you embark on a web scrape, ask yourself the following questions to ensure you stay legit.

Am I Scraping PII?

PII means any data that can be used to identify an individual, either directly or indirectly. Name is an obvious one, as is address, payment card details, email address, etc. It could be an image, so if you’re looking to grab, say, a HEIC image with a view of doing something like turn HEIC to PDF, you may need to check that it doesn’t contain any PII.

If the data doesn't include any PII, then you're in the clear.

Where Am I Scraping?

GDPR covers the EU and the UK, plus a few other territories such as Iceland and Norway. Here’s the kicker, though. It doesn’t matter if the company is or isn’t based in that area. It’s the action and whom you’re doing it to that counts.

Norway flag

In other words, if you’re carrying out an action in the US that results in the acquisition and processing of the PII of a French citizen, you’d better be GDPR compliant.

Do I Have a Lawful Reason?

See those justifications above.

Is the Data Sensitive?

Just to complicate matters, GDPR classes certain data as sensitive. Examples include ethnic origin and political opinions. If you’re intent on scraping for this data, then you’d better have your consent absolutely watertight, along with your justification for using it.

Have I Checked the IP Consent?

IP addresses are considered by GDPR to be PII, so if you’re using any EU residential proxies while you’re doing your scraping, the owner of that proxy has to give consent.

Some Final Tips

Web scraping sure can be challenging. Here are some final tips to help you ensure you're getting the right data in an ethical and efficient manner.

  • Beware of misinformation

It’s held by some to be the case that if PII data is publicly available, it’s fine to be scraped for whatever purpose you have in mind. 

This is not so. An individual submitting their details onto, say, a review platform has done so just with a specific activity in mind. They haven’t given consent for those details to be grabbed by another party and used for a product promotion mailout.

Be careful also regarding intellectual property. Again, just because something is reachable, it doesn’t mean you can go ahead and grab it. 

  • If in doubt, inform

Data subjects, in most cases, need to be informed about data usage. If uncertain about this, it’s good practice to err on the side of informing. So, even if you think you have a vital interest in using the data, issue the subject with an update. It can even be retrospective if time is demonstrably an issue. 

  • Be responsive

Data Subject Access Rights (DSAR) dictate that you must reveal what you have on an individual when requested. Do this in a timely fashion.

If there’s a breach, report it. No good will come of trying to cover up. If the breach is likely to be a threat to an individual’s fundamental privacy rights, you have three days to inform the authorities.

person web scraping on a mac

Happy Scraping

So, now you know what to do. It’s not the nightmare it could be. You simply have to be sure of what you’re scraping and why and take steps to secure permission where necessary. More information regarding GDPR is all over the net, should you need more details. 

But you can do much worse than remembering that data is not legitimately yours just because you can access it. There are rules out there, so comply with them, then you’ll be in good shape to scrape. 

In short, scrape straight to scrape safe!