Web Scraping for AI Training: Can it Comply with GDPR?

Unsure if scraping public data for AI training is GDPR-compliant? EU regulators are raising concerns. Learn best practices to minimize data protection risks.

With the rising use of generative AI tools, web scraping techniques to collect personal data from websites and use them for training are becoming increasingly common. In the EU, however, are such practices always GDPR compliant?

Personal data does not lose its protection under GDPR simply by being published online. This implies that when collecting publicly available data, it is essential to comply with GDPR principles, many of which are currently being challenged in cases of web scraping.

Recently, the European Data Protection Supervisor, with its Guidelines on Generative AI: embracing opportunities, protecting people, expressed concerns around scraping, especially about:

Individuals’ loss of control over their data for processing that goes beyond their expectations
Processing for additional purposes than the purpose for which the data was originally published
Compliance with data minimization and accuracy principles

Also, the Dutch data protection authority, with its Guidelines on scraping by private organizations and individuals (in Dutch), highlighted a number of data protection risks connected with web scraping. Nevertheless, the Dutch guidelines only addressed the issue of the lawfulness of scraping and the elements that may affect the legitimate interest assessment required for most scraping activity.

Most significantly, the Dutch guidelines conclude that only targeted scraping—i.e., very limited in sources and purposes—is compatible with GDPR. However, they leave it up to controllers to assess the lawfulness and viability of such processing on a case-by-case basis. Despite the limited scope of these guidelines, they mark one of the first attempts by EU data protection authorities to offer practical guidance on ensuring that scraping activities comply with EU legislation.

Also Read: Why Data Quality Matters for Business Success

In this same respect, the EDPB, in its recent Report of the work undertaken by the ChatGPT Taskforce, albeit focused on OpenAI data processing, gave important indications as to certain technical measures that may reduce the impact of web scraping on individuals by “defining precise collection criteria and ensuring that certain data categories are not collected or that certain sources (such as public social media profiles) are excluded from data collection” or adopting measures to “delete or anonymize personal data […] before the training stage.” It also seems to recognize that the characteristics of scraping justify the provision of a privacy notice only via public means (art. 14(5)(b) GDPR), provided that the controller duly documents the reasons for such provision.

Analogous recommendations have been put forward by the French data protection authority (CNIL) in its latest Factsheet on web scraping and the legitimate interest (in French). In particular, the CNIL reiterated the importance of defining collection criteria (excluding sensitive data and, possibly, any personal data) and promptly deleting/anonymizing unnecessary personal data right after their collection. Amongst collection criteria, the CNIL, in line with the previous interpretations, suggested the following:

Exclude from scraping (by default) certain websites that include sensitive or large volumes of data
Exclude from scraping websites that prohibit scraping or re-use of data for AI training purposes (e.g., by using robot.txt or ai.txt files)
Limit collection to freely accessible data (i.e., not reserved areas) and manifestly made public by the user (e.g., by selecting the appropriate settings in their social network profiles)

It also introduced some interesting suggestions to foster transparency of scraping practices as well as data subjects’ control over their data, including:

Setting up a blacklist, managed by the scraper, to allow data subjects to object to the collection of their data from certain websites/platforms
Allowing data subjects to object to the processing on a discretionary basis
Disseminating information about the scraper’s data collection practices and data subjects’ rights as widely as possible
Preventing any cross-checking of the data based on individual identifiers

Also Read: Quantum Computing: Friend or Foe to Cybersecurity?

Even though a holistic GDPR approach towards scraping is yet to be defined, more guidance from authorities on GDPR and scraping is expected to follow shortly, especially after the conclusion of investigations on OpenAI data practices. In the meantime, without a clear position, any scraping activity will inherently trigger data protection risks, leaving the controller with the onerous need to assess its compliance with GDPR (also via a DPIA) and monitor any new developments.

Web Scraping for AI Training: Can it Comply with GDPR?

Must read

Sherpa.ai Raises $18M for Sovereign AI Platform

Actian Adds Jaspersoft to Data Management Portfolio

Why Most Enterprise AI Agent Pilots Never Reach Production

Meta Launches Pocket, an AI Gaming App

Unsure if scraping public data for AI training is GDPR-compliant? EU regulators are raising concerns. Learn best practices to minimize data protection risks.

More articles

Latest posts

Sherpa.ai Raises $18M for Sovereign AI Platform

Actian Adds Jaspersoft to Data Management Portfolio

Why Most Enterprise AI Agent Pilots Never Reach Production

Meta Launches Pocket, an AI Gaming App

Ex-DeepMind Trading Startup EquiLibre Hits $500M Valuation

Datadog Acquires Adaptive ML to Boost AI Research

The Recruiter Was Never the Problem. The Incentives Were.

Quick Links

Popular Categories

What to Read Next

Ex-DeepMind Trading Startup EquiLibre Hits $500M Valuation

Actian Adds Jaspersoft to Data Management Portfolio