Ethical Data Scraping for Research – Expert Workshop held in Amsterdam

A unique, expert-led workshop on ethical data scraping was organized by Professor Niva Elkin-Koren and Dr. Maayan Perel and hosted by the Shamgar Center of Digital Law and Innovation, Tel Aviv University. The workshop was made possible by the generous support of the Right to Research in International Copyright Law coalition at the American University, especially Professor Sean Flynn, the Director of the Program on Information Justice and Intellectual Property (PIJIP).

An interdisciplinary group of information law experts gathered in Amsterdam’s beautiful Volks hotel on July 2, 2025, to discuss data scraping for research and innovation and its ethical boundaries. The event aligned with the agenda of the Standing Committee on Copyright and Related Rights (SCCR), which promotes public interest strategies, coordinated action, and research, and seeks to inform public policy on legal exceptions and limitations for researchers.

Data scraping is an essential research tool for academics and scientists across a wide range of disciplines. It is also critical for training artificial intelligence (AI) models and developing innovative research methodologies. The legal boundaries of data scraping attract considerable attention, not only from academics but also from policymakers, governments, courts, technology companies, and data providers worldwide.

The boundaries of ethical data scraping— often dependent on the type of data being scraped, the technologies being used, the purpose of scraping, and the applicable legal framework—remain unclear. Consequently, researchers are left to navigate the potential legal risks and changing technological barriers set by tech giants, such as Cloudflare (recently adopting a permission-based approach to data scraping). As a result, researchers may be deterred from engaging in lawful data scraping, at the cost of not engaging in research that can serve the public interest.

Moderated by Dr. Maayan Perel and Professor Eldar Haber, the workshop aimed to bring greater clarity to what ethical data scraping is and should be. The workshop applied practical and technical insights from real-world data scraping, analyzed the legal implications of various transatlantic approaches, and proposed guidelines for promoting ethical data scraping for research and development.

To obtain a better understanding of how data scraping models work in practice, participants explored a test case model from Bright Data, an international data scraping company, whose model was also discussed in recent litigation with X and Meta. In a stimulating presentation, Bright Data representatives described their publicly available data scraping technology, elaborated on their ethical policies, and presented their “data for good” initiative, which offers scraping opportunities for researchers as well as other stakeholders.

To encourage a productive dialogue between academic and business participants, the discussion followed a “red teaming” approach. Red teaming, a concept we adapted from the cybersecurity realm, essentially aims to help organizations proactively identify weaknesses and strengthen their security posture before actual attacks occur. Applying red-teaming’s critical approach, the participants identified potential legal challenges in Bright Data’s data test case model from various perspectives, including intellectual property law, competition law, privacy law, and data protection law, while also identifying points of legal tension between the US and the EU frameworks.

The issues highlighted included the legal application of copyright law to information copying and storage; questions of competition law arising from the dominant market actors’ ability to adjust behavior and match prices; and the scope of privacy protection in personal information that data providers voluntarily make publicly accessible.

Next, insights from Bright Data’s test case were used to draw broader observations about what constitutes ethical data scraping in practice, especially for AI training.

Key issues included:

Aligning AI training with the commercial/non-commercial divide governing EU copyright law;
Applying “opt-out” options in commercial cases of text and data mining (TDM);
The lack of analysis of scraping and TDM in US copyright law cases;
Applying fair use analysis to training generative AI models with pirated material following two recent US cases, Bartz v. Anthropic and Kadrey v. Meta;
Applying Section 1202 of the Digital Millennium Copyright Act (DMCA), which restricts intentionally removing or changing copyright management information, against scrapers;
The impact of the US approach on licensing regimes, authors’ incentives, and the use of paywalls;
Potential overlaps between intellectual property and contract law and the role of preemption (e.g., the cases of Ryanair in the European Union and ML Genius in the United States);
A right to research and access data, and the flawed mechanisms of Article 40 of the Digital Services Act;
The potential misuse of trade secrets to stifle freedom of expression and information, and the protection of public interest exceptions provided by Article 5 of the EU Trade Secrets Directive;
Transatlantic differences between the European Union and the United States regarding what privacy means, especially in relation to publicly available data;
The power of contracts to override rights of access and the unintended consequences of strong privacy protection in creating data enclosures.

The workshop concluded with a broader discussion of potential legal, technical, and institutional strategies to promote ethical data scraping for academic research and technological development. Participants identified the need to distinguish between questions of access to data and questions of the use of the data, as each raises different legal issues.

Key suggestions included:

Recognizing data access as an institutional challenge meriting governmental intervention and affirming the role of the state in providing access to data as a public good;
Facilitating systemic data access for research, possibly through standardized licensing;
Formulating a comprehensive framework of exceptions and limitations for researchers applicable beyond a specific legal domain;
Adopting a nuanced approach facilitated by technological solutions;
Framing “a right to research” with duties to respect and (institutionally) promote the rights of researchers;
Exploring how EU data spaces can protect data sharing in specific domains against prevalent “opt-out” frameworks;
Adopting cross-border solutions;
Exploring potential models of authors’ remuneration (e.g., lump sum, price tags, collective licensing, scrape first–pay later payments);
Addressing asymmetries of agency, power, and data in defining ethical data scraping;
Using social licenses to document data subjects’ preferences and reflect their expectations;
Identifying ethical scraping options, like Creative Commons for AI training;
Ensuring the enforceability of social licenses via institutional vehicles;
Developing a methodology to measure the value and impact of making data open;
Analyzing the interplay between top-down regulation of data access and open-ended norms;
Releasing the delegated act on data access under the Digital Service Act;
Fostering open data in both public and private spheres.