Measures to protect against web scraping for training generative AI
The practice of web scraping
Generative AI models are a type of artificial intelligence capable of creating new content such as text, images, or music. To train them, large amounts of data are required. One method to obtain this data is through web scraping, which involves extracting information from web pages.
Data scraping is a technique that employs software to automatically extract information from websites. It functions similarly to how a human user would: the program sends requests to the website, receives HTML pages in response, and then extracts the relevant data. This process can be broken down into several steps: first, the website and the specific data sought are identified. Next, the structure of the website is analyzed to understand how the data is stored. After this, a computer program called a scraper is developed to extract the data. Finally, the scraper is run to obtain the information.
Data scraping has a wide range of applications, such as market research to gather data on prices, products, and competitors; web data analysis to gain insights into user behavior on a website; and training generative AI. However, this technique can collect personal information, raising a data protection issue. The practice of web scraping, although useful, can lead to potential violations of privacy and data protection laws if not managed properly.
Data protection
Training generative AI models, such as those used for creating text, images, or music, necessitates large volumes of data. Utilizing web scraping to acquire this data presents a conflict with privacy because this technique can collect information that can be attributed to an identified or identifiable individual, resulting in a data protection issue.
In many instances, data that identifies individuals, such as names, email addresses, or phone numbers, can be collected. If this personal data is used to train AI models that generate content including identifiable personal information, it would constitute a data protection violation.
A significant example of this issue is the €20 million fine imposed by the Italian Data Protection Authority, IL GARANTE, on CLEARVIEW AI for using web scraping to collect personal information from users without consent.
Regulation for generative AI
This issue has led the Italian Data Protection Authority to publish a document outlining a set of measures that website operators should take to prevent web scraping of potential personal data on their websites. These measures are designed to ensure compliance with data protection laws and to protect the privacy of individuals whose data might be scrapped.
In this regard, and in compliance with Article 5 of the GDPR, the measures proposed by the Garante to prevent web scraping are as follows:
- Restrict access to specific areas through prior registration. This measure allows controlling access to information without the need for excessive data processing, thus eliminating its public availability. By requiring users to register before accessing certain areas of a website, operators can monitor and control who accesses their data.
- Prevent data extraction from legal notices. Although this measure can only be applied retroactively or as a deterrent, it is a special preventive measure with a deterrent effect, distinguishing it from the previous one. Legal notices often contain critical information that, if scraped, can lead to significant data breaches.
- Reduce network traffic and the number of requests by selecting only those coming from specific IP addresses. This prevents excessive data traffic preemptively. By limiting access to specific IP addresses, websites can reduce the likelihood of being targeted by scrapers.
- Limit the use of bots to curb automatic data collection. Measures such as including CAPTCHA, using robots.txt, or incorporating protected content in multimedia files can be implemented. These tools can help distinguish between human users and automated bots, thus preventing unauthorized data scraping.
It is important to note that, as the Garante points out, these measures are not unique recommendations and therefore require a case-by-case analysis.