{"id":16119,"date":"2024-07-08T06:00:00","date_gmt":"2024-07-08T06:00:00","guid":{"rendered":"https:\/\/letslaw.es\/?p=16119"},"modified":"2024-07-03T14:51:53","modified_gmt":"2024-07-03T14:51:53","slug":"protection-web-scraping","status":"publish","type":"post","link":"https:\/\/letslaw.es\/en\/protection-web-scraping\/","title":{"rendered":"Measures to protect against web scraping for training generative AI"},"content":{"rendered":"<h2><span style=\"font-weight: 400;\">The practice of web scraping<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Generative AI models are a type of artificial intelligence capable of creating new content such as text, images, or music. To train them, large amounts of data are required. One method to obtain this data is through <strong>web scraping, which involves extracting information from web pages<\/strong>.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Data scraping is a technique that employs software to automatically extract information from websites. It functions similarly to how a human user would: the program sends requests to the website, receives HTML pages in response, and then extracts the relevant data. This process can be broken down into several steps: first, the website and the specific data sought are identified. Next, the structure of the website is analyzed to understand how the data is stored. After this, a computer program called a scraper is developed to extract the data. Finally, the scraper is run to obtain the information.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Data scraping has a wide range of applications, such as market research to gather data on prices, products, and competitors; web data analysis to gain insights into user behavior on a website; and training generative AI. However, this technique can collect personal information, raising a data protection issue. The practice of web scraping, although useful, <strong>can lead to potential violations of privacy and data protection laws if not managed properly<\/strong>.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Data protection<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Training generative AI models, such as those used for creating text, images, or music, necessitates large volumes of data. Utilizing web scraping to acquire this data presents a <strong>conflict with privacy because this technique can collect information that can be attributed to an identified or identifiable individual<\/strong>, resulting in a <a title=\"data protection\" href=\"https:\/\/letslaw.es\/en\/category\/data-protection\/\">data protection<\/a> issue.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In many instances, data that identifies individuals, such as names, email addresses, or phone numbers, can be collected. If this personal data is used to train AI models that generate content including identifiable personal information, it would constitute a <a title=\"data protection violation\" href=\"https:\/\/letslaw.es\/en\/data-protection-and-its-connection-to-artificial-intelligence\/\">data protection violation<\/a>.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A significant example of this issue is the \u20ac20 million fine imposed by the Italian Data Protection Authority, IL GARANTE, on CLEARVIEW AI for using web scraping to collect personal information from users without consent.\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Regulation for generative AI<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">This issue has led the Italian Data Protection Authority to publish a document outlining a set of measures that website operators should take to prevent web scraping of potential personal data on their websites. These measures are designed to ensure compliance with data protection laws and to protect the privacy of individuals whose data might be scrapped.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this regard, and in compliance with Article 5 of the GDPR, the measures proposed by the Garante to prevent web scraping are as follows:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Restrict access to specific areas through prior registration<\/strong>. This measure allows controlling access to information without the need for excessive data processing, thus eliminating its public availability. By requiring users to register before accessing certain areas of a website, operators can monitor and control who accesses their data.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Prevent data extraction from legal notices<\/strong>. Although this measure can only be applied retroactively or as a deterrent, it is a special preventive measure with a deterrent effect, distinguishing it from the previous one. Legal notices often contain critical information that, if scraped, can lead to significant data breaches.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Reduce network traffic and the number of requests<\/strong> by selecting only those coming from specific IP addresses. This prevents excessive data traffic preemptively. By limiting access to specific IP addresses, websites can reduce the likelihood of being targeted by scrapers.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Limit the use of bots to curb automatic data collection<\/strong>. Measures such as including CAPTCHA, using robots.txt, or incorporating protected content in multimedia files can be implemented. These tools can help distinguish between human users and automated bots, thus preventing unauthorized data scraping.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">It is important to note that, as the Garante points out, these measures are not unique recommendations and therefore require a case-by-case analysis. <\/span><\/p>\n<div class=\"cyp_post_formulario\"><h2>Contact Us<\/h2>\n<div class=\"wpcf7 no-js\" id=\"wpcf7-f3074-o1\" lang=\"es-ES\" dir=\"ltr\" data-wpcf7-id=\"3074\">\n<div class=\"screen-reader-response\"><p role=\"status\" aria-live=\"polite\" aria-atomic=\"true\"><\/p> <ul><\/ul><\/div>\n<form action=\"\/en\/wp-json\/wp\/v2\/posts\/16119#wpcf7-f3074-o1\" method=\"post\" class=\"wpcf7-form init wpcf7-acceptance-as-validation\" aria-label=\"Formulario de contacto\" novalidate=\"novalidate\" data-status=\"init\">\n<fieldset class=\"hidden-fields-container\"><input type=\"hidden\" name=\"_wpcf7\" value=\"3074\" \/><input type=\"hidden\" name=\"_wpcf7_version\" value=\"6.1.5\" \/><input type=\"hidden\" name=\"_wpcf7_locale\" value=\"es_ES\" \/><input type=\"hidden\" name=\"_wpcf7_unit_tag\" value=\"wpcf7-f3074-o1\" \/><input type=\"hidden\" name=\"_wpcf7_container_post\" value=\"0\" \/><input type=\"hidden\" name=\"_wpcf7_posted_data_hash\" value=\"\" \/><input type=\"hidden\" name=\"_wpcf7_recaptcha_response\" value=\"\" \/>\n<\/fieldset>\n<div class=\"campo_nombre\" style=\"width:100%\"> <span class=\"wpcf7-form-control-wrap\" data-name=\"your-name\"><input size=\"40\" maxlength=\"400\" class=\"wpcf7-form-control wpcf7-text wpcf7-validates-as-required datos-contacto2\" aria-required=\"true\" aria-invalid=\"false\" placeholder=\"Name\" value=\"\" type=\"text\" name=\"your-name\" \/><\/span><\/div>\n<div class=\"campo_telefono\" style=\"width:100%\"> <span class=\"wpcf7-form-control-wrap\" data-name=\"your-phone\"><input size=\"40\" maxlength=\"400\" class=\"wpcf7-form-control wpcf7-tel wpcf7-validates-as-required wpcf7-text wpcf7-validates-as-tel datos-contacto2\" aria-required=\"true\" aria-invalid=\"false\" placeholder=\"Phone\" value=\"\" type=\"tel\" name=\"your-phone\" \/><\/span><\/div>\n<div class=\"campo_email\" style=\"width:100%\"> <span class=\"wpcf7-form-control-wrap\" data-name=\"your-email\"><input size=\"40\" maxlength=\"400\" class=\"wpcf7-form-control wpcf7-email wpcf7-validates-as-required wpcf7-text wpcf7-validates-as-email datos-contacto2\" aria-required=\"true\" aria-invalid=\"false\" placeholder=\"Email\" value=\"\" type=\"email\" name=\"your-email\" \/><\/span><\/div>\n<div class=\"campo_asunto\" style=\"width:100%\"> <span class=\"wpcf7-form-control-wrap\" data-name=\"your-asunto\"><input size=\"40\" maxlength=\"400\" class=\"wpcf7-form-control wpcf7-text wpcf7-validates-as-required datos-contacto2\" aria-required=\"true\" aria-invalid=\"false\" placeholder=\"Subject\" value=\"\" type=\"text\" name=\"your-asunto\" \/><\/span><\/div>\n<div class=\"campo_mensaje\" style=\"width:100%\"> <span class=\"wpcf7-form-control-wrap\" data-name=\"your-mensaje\"><textarea cols=\"40\" rows=\"10\" maxlength=\"2000\" class=\"wpcf7-form-control wpcf7-textarea wpcf7-validates-as-required datos-contacto2\" aria-required=\"true\" aria-invalid=\"false\" placeholder=\"Message\" name=\"your-mensaje\"><\/textarea><\/span><\/div>\n<input class=\"wpcf7-form-control wpcf7-hidden\" value=\"\" type=\"hidden\" name=\"cyp_form_url\" \/>\n<input class=\"wpcf7-form-control wpcf7-hidden\" value=\"cyp_zonaweb\" type=\"hidden\" name=\"zonaweb\" \/>\n<span class=\"wpcf7-form-control-wrap recaptcha\" data-name=\"recaptcha\"><span data-sitekey=\"6LfbCuUpAAAAAGu5f0__hms_y9Kscc_NCNdDGnEJ\" class=\"wpcf7-form-control wpcf7-recaptcha g-recaptcha\"><\/span>\r\n<noscript>\r\n\t<div class=\"grecaptcha-noscript\">\r\n\t\t<iframe loading=\"lazy\" src=\"https:\/\/www.google.com\/recaptcha\/api\/fallback?k=6LfbCuUpAAAAAGu5f0__hms_y9Kscc_NCNdDGnEJ\" frameborder=\"0\" scrolling=\"no\" width=\"310\" height=\"430\">\r\n\t\t<\/iframe>\r\n\t\t<textarea name=\"g-recaptcha-response\" rows=\"3\" cols=\"40\" placeholder=\"Aqu\u00ed la respuesta de reCAPTCHA\">\r\n\t\t<\/textarea>\r\n\t<\/div>\r\n<\/noscript>\r\n<\/span>\n<div style=\"width:100%\">\n<p class=\"form-input-check\" style=\"color:#444444 !important;padding:0px !important;margin:0px !important;font-size:12px !important;margin-bottom:15px !important\">\nBy clicking on \"Send\" you accept our <a href=\"https:\/\/letslaw.es\/en\/privacy-policy\/\" target=\"_blank\">Privacy Policy<\/a> - <a href=\"javascript:\/\/\" class=\"cyp_legal_popup_ingles\">+ Info<\/a>\n<\/p>\n<p class=\"form-input-check\" style=\"color:#444444 !important;padding:0px !important;margin:0px !important;font-size:12px !important\">\n<span class=\"wpcf7-form-control-wrap\" data-name=\"checkbox-173\"><span class=\"wpcf7-form-control wpcf7-checkbox wpcf7-exclusive-checkbox\"><span class=\"wpcf7-list-item first last\"><label><input type=\"checkbox\" name=\"checkbox-173\" value=\"\" \/><span class=\"wpcf7-list-item-label\"><\/span><\/label><\/span><\/span><\/span> I agree to receive outlined commercial communications from LETSLAW, S.L. in accordance with the provisions of our <a href=\"https:\/\/letslaw.es\/en\/privacy-policy\/\" target=\"_blank\">Privacy Policy<\/a> - <a href=\"javascript:\/\/\" class=\"cyp_legal_popup\">+ Info<\/a>\n<\/p>\n<\/div>\n<div class=\"vc_col-sm-12 botton-datos-contacto\"><input class=\"wpcf7-form-control wpcf7-submit has-spinner\" type=\"submit\" value=\"Send\" \/><\/div><input type='hidden' class='wpcf7-pum' value='{\"closepopup\":false,\"closedelay\":0,\"openpopup\":false,\"openpopup_id\":0}' \/><div class=\"wpcf7-response-output\" aria-hidden=\"true\"><\/div>\n<\/form>\n<\/div>\n<div>","protected":false},"excerpt":{"rendered":"<p>Generative AI models are a type of artificial intelligence capable of creating new content such as text, images, or music.<\/p>\n","protected":false},"author":72,"featured_media":16117,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[258],"tags":[],"class_list":["post-16119","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-digital-law"],"_links":{"self":[{"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/posts\/16119","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/users\/72"}],"replies":[{"embeddable":true,"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/comments?post=16119"}],"version-history":[{"count":4,"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/posts\/16119\/revisions"}],"predecessor-version":[{"id":16137,"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/posts\/16119\/revisions\/16137"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/media\/16117"}],"wp:attachment":[{"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/media?parent=16119"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/categories?post=16119"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/letslaw.es\/en\/wp-json\/wp\/v2\/tags?post=16119"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}