Case Studies

Case Studies, TDM Cases

Ethical Sourcing of African Language Data: Lanafrica and the NOODL licence

Over 2,000 African languages are spoken by approximately 1.4 billion people on the continent, showing how linguistic diversity underpins African democracy, development, and cultural life. As artificial intelligence becomes central to progress in areas like healthcare, agriculture and education, new methods for collecting and sharing African language data are urgently needed—methods that support innovation while protecting community interests. In this video we look at two pioneering initiatives in the field of ethical data sourcing in Africa. The Mining Mindset: From Extraction to Partnership Historically, many dataset projects have adopted an extractive approach: gathering language data from communities with little return or recognition for the contributors. Professor Gloria Emezue, Research Lead at Lanafrica, diagnoses this approach: “We saw the whole idea of exploitation of data as something of a mindset. When you’re out to mine something, you’re actually exploiting that thing. We now say, okay, let’s have a new approach, which we call data farming.” Rather than “mining” communities for linguistic resources, Lanafrica works to “farm” data collaboratively—turning village squares into living laboratories, involving contributors directly, and ensuring that partnerships support further community-driven research. A practical example is NaijaVoices: over 1,800 hours of curated audio data in Nigerian languages, sourced through relationships built on respect and transparency. For Lanafrica, research access is free, but commercial use involves agreements that channel support back to the language communities, creating a virtuous cycle of benefit and ongoing dataset growth. Who Owns the Data—and Who Reaps the Rewards? Behind every local dataset is a web of legal and ethical questions about ownership and benefit sharing. Dr Melissa Omino, Director at the Centre for Intellectual Property and Information Technology Law (CIPIT), Strathmore University, notes the paradox that emerges when communities contribute their voices but lose control over how their words are used: “Where do you get African languages from? Your first source would be the communities that speak the African languages, right? So you go to the community, record them, the minute you make the recording, then you have created a copyrighted work, and whoever who has made the recording owns the copyright..” According to Professor Chijoke Okorie of the Data Science Law Lab at the University of Pretoria, What ends up happening is that the communities that created these datasets end up paying for products that are built on the datasets that they created. Addressing these inequities requires new licensing models that prioritize context, impact, and fair division of benefits. NOODL Licence: Reimagining Access and Justice The Nwolite Obodo Open Data Licence (NOODL) breaks new ground in recognising African realities. It creates a two-tier licensing system: broad, cost-free access for African users, and negotiations or royalties for wealthy, commercial or international users. Communities remain central, not secondary. Professor Chijoke Okorie describes the ethos of NOODL: “Nwolite Obodo is Igbo… for community raising, community development. We’ve grown from a research group at the University of Pretoria to a research network of researchers from both Anglophone and Francophone Africa.” NOODL’s design puts community benefit at the forefront. As Dr Melissa Omino explains: “If you are from a developing country, let’s say you’re from Brazil, and you want to use the data that is licenced under this regime, then you can use it under Creative Commons licence. If you are a multi-million dollar tech company that wants to use the licence, then you need to negotiate with the AI developers who collected the data. And the licence also ensures that the community gets a benefit.” Towards More Equitable Language Data If African language data is simply extracted and exported, local voices risk exclusion—both from representation and from the benefits of digital innovation. By refocusing efforts on partnership and fair licensing, projects like Lanafrica and NOODL demonstrate that ethical, sustainable language technology is achievable. Their experiences may help guide how other communities and researchers engage with digital resources and cultural heritage—ensuring language data works for those who speak it.

Case Studies

Dams In the Infinite River: Limits to Copyright’s Power Over the Next-Generation of Generative AI Media

Can copyright effectively govern creative works in an age of infinite digital remixing? Is it possible—or even reasonable—to audit how creators use AI in their art? Should copyright law adapt to prioritize attribution and visibility over strict ownership? What new models are needed for artists to capture value in a world of mass, AI-driven content? Questions raised by Zachary Cooper in his presentation. Watch the video or read the transcript below. Zachary Cooper, a researcher at the Amsterdam Law and Technology Institute (Vrije Universiteit Amsterdam), delivered a provocative address at the recent User Rights Conference in Geneva in June 2025 entitled “Dams for the Infinite River.” Cooper’s talk challenged conventional debates in copyright law by spotlighting the seismic impact of AI on media creation. Instead of focusing on traditional authorship thresholds, Cooper urged legal scholars and policy-makers to grapple with deeper issues emerging as generative AI lets anyone remix, alter, and mass-produce creative content with unprecedented ease. His research was supported by the Weizenbaum Institute in Berlin and the Copyright Society. Throughout his presentation, Cooper highlighted the profound difficulties facing rights holders in auditing creative works crafted with AI, questioning both the feasibility and ethics of tools like watermarking. He noted that legal frameworks around the world are ill-equipped to handle the blurred boundaries between human and machine authorship, which can vary dramatically from one country—or even one creative tool—to the next. As technology enables infinitely variable, interactive content, Cooper argued that copyright law is rapidly losing its power to define, control, and protect creative output. Cooper concluded by suggesting that copyright is akin to “a dam in an infinite river”—an increasingly obsolete barrier in a world defined by endless remix and transformation. He warned that unless legal and industry leaders embrace collective licensing models and prioritize attribution and visibility, platforms with massive network effects will continue to undercut the negotiating power of original creators. The talk raised urgent questions about the future of artistic value and legal protection in the era of generative AI. Cleaned-Up Full Transcript Below is the transcript, with minor clean-up for clarity, grammar, and removal of interruptions (for example, fixing sentence structure and omitting interjections like “Maybe some of you guys already know this song. Is it coming through? Oh, hold on.” unless directly relevant). I’ll also clarify some sentences and remove redundancy and filler where possible. Stage directions (music playing, technical interruptions) will be omitted unless needed for meaning. I’m Zach Cooper from the Amsterdam Law and Technology Institute at Vrije Universiteit Amsterdam. This research has also been sponsored and supported by the Weizenbaum Institute in Berlin and the Copyright Society. I’ve titled the presentation “Dams for the Infinite River.” What do I mean by that? I’ve been trying to reframe the conversation around what I believe are the actual challenges as AI dramatically changes the ways we consume and produce media in the 21st century. Instead of focusing on longstanding debates about authorship thresholds, I argue that a collection of unspoken challenges will more fundamentally shape the issues facing rights holders over the coming decade—some already present, others just emerging. As a cultural reference, I was encouraged to revisit Taylor Swift’s music, and her work offers a useful metaphor for my argument. With new technologies, it’s possible to change a song’s structure, lyrics, or style almost instantly. For instance, I was able to alter one of her songs for this presentation much faster than it took us to listen to it here. That example raises a fundamental question: Now that we can so easily transform any media, what role does intellectual property play? How can it govern such a fluid world? It’s become widely accepted that pressing a button to generate content isn’t the same as creating something as an artist. However, definitions differ worldwide. In China, a lengthy and detailed AI prompt might count as authorship. In the US, AI is considered a tool, but the boundaries between “human” and “AI-generated” work remain undefined. Europe has largely avoided the issue, though courts have denied copyright to works made by AI without human involvement. The problem with all these approaches is that they split copyright protection based on how much AI was used. In reality, professional creative software has countless generative and non-generative tools, and creators often use many in combination. At present, pressing one set of buttons might grant copyright; pressing another might not—yet how can we audit those choices? When people talk about “AI,” they imagine a single function, but AI can do anything: from mastering music to generating drumlines, altering instrument sounds, or even creating new genres. The spectrum of creative tools and applications is vast. And generative art isn’t simply “slop” made thoughtlessly; sometimes, the act of generation itself creates entirely new forms of art. Artists like Databots stream infinitely generated music, turning the process and the model itself into the artwork. Labeling something as “AI-generated” doesn’t reveal anything meaningful about the creator’s relationship to the work. Judges, copyright offices, and others cannot just ask whether someone used generative AI—they need to know exactly how it was used, down to the specifics of each prompt or button pressed. Currently, there’s no way to reliably track or audit these creative decisions. The available methods are either to trust creators to self-report their practices, which is unreliable, especially for older works, or to track every creative act—a solution that invades privacy and may stifle creativity. Watermarking is frequently proposed but is technically weak; watermarks can be stripped from images or audio files, and existing protocols like C2PA are dependent on metadata, which is easy to remove. Even with robust watermarking, creators could simply recreate works without relying on AI, bypassing any AI-detection entirely. Thus, the system is inconsistent and essentially unworkable. Current enforcement tries to assign rights based on the use of AI features that remain unauditable and poorly defined, never truly capturing the creator’s intent or involvement. Alternatively, we accept new creative models and ask: What are the real

Case Studies

The Global Evolution of AI Fact-Checking: Copyright and Research Gaps

Abstract The propagation and evolution of AI-powered fact-checking tools worldwide has foregrounded the issue of access to quality training data. While many of the most widely adopted systems have originated in the Global North, there are also notable and growing efforts in the Global South to develop and adapt fact-checking technologies. As AI tools develop in new regions, they raise important questions about the differential impact of factors such as copyright on training data access, pointing to persistent obstacles and areas needing further research. 1. Introduction The spread of AI-powered fact-checking tools marks a significant shift in how misinformation is detected and addressed, transforming practices in journalism and civil society across continents. Systems that began in research institutions and newsrooms in the Global North are now being adapted and implemented in local contexts—including Africa, Latin America, and the Middle East. Although some widely used technologies originated in the North, emerging initiatives in the Global South are beginning to shape the landscape through locally designed and customized tools. This evolution brings renewed attention to longstanding questions: Which factors shape obstacles or biases within training data?  Do factors like resources constraints or copyright laws shape who can access, adapt, and benefit from these AI fact checking systems? As adoption extends to newer regions, the risk of unequal impact increases —not only due to linguistic and technical barriers, but also because of differences in copyright policies and regulatory environments. When it comes to copyright, there is a need for further research, especially as global partnerships and technology transfer intensify. 2. The Evolution and Impact of North-South AI Fact-Checking Partnerships Recent years have seen the emergence of global collaborations for AI-powered fact-checking bridging expertise between the Global North and South. One example began with the development and piloting of AI technologies by Full Fact—a UK-based fact-checking organization—and was then extended through partnerships with organisations such as Africa Check (Africa), Chequeado (Latin America), and the Arab Fact-Checking Network (Middle East). While most early implementation focused on English-language media monitoring and elections in developed countries, philanthropic support and local adaptations have greatly expanded reach and relevance. Scale and Spread: 3. Concrete Examples of AI Fact-Checking through Partnerships Nigeria (Africa Check) During Nigeria’s 2023 elections, Africa Check deployed AI-powered claim detection and transcription systems (developed in partnership with Full Fact) to monitor over 40,000 daily “fact-checkable claims” from more than 80 media sources. These claims were algorithmically screened for verification potential; a subset were selected for intensive human review and public debunking. This enabled fact-checkers to respond at unprecedented scale to viral misinformation and political rumors, with documented improvements in election monitoring and rapid rebuttal. Argentina (Chequeado) In Latin America, Chequeado integrated the partnership’s core AI technologies—adapting them into their “Chequeabot” system for Spanish-language fact-checking. Used during major televised debates and news events, Chequeabot supplied journalists with real-time claim detection and prioritization, flagging misleading or controversial statements for timely investigation and reporting. Middle East (Arab Fact-Checking Network) The Arab Fact-Checking Network and affiliated organizations have recently adopted the same AI tools for live tracking and countering misinformation in Arabic-language media. During national elections and crises, these technologies enabled rapid, high-volume monitoring of broadcast and online claims, supporting collaborative efforts to uphold media integrity and provide accurate information under pressure. 4. The Impact of Copyright Restrictions on Training Data Quality A critical challenge for AI-powered fact-checking is that legal and financial constraints, including copyright restrictions, can limit access to high-quality, rigorously verified journalism. Leading news sources may be behind paywalls or subject to licensing requirements, making them difficult or costly to include in AI training sets. By contrast, low-quality sensationalised content can be more easily available, promoted by online algorithms. This discrepancy between access to low quality vs. verified information sources partly explains the increased need for AI fact checking. But it also presents the danger that automated fact checking tools may be skewed though their training data toward unreliable sources. Without intentional curation and targeted copyright exceptions and/or ethical licensing, automated systems risk amplifying lower-quality information or missing key local knowledge. Projects must prioritize source verification, data provenance transparency, and partnerships with credible media to preserve integrity—while more research is needed on how copyright affects access and outcomes in different regions. 5. Obstacles Facing AI Fact-Checking Tools in the Global South Where large-scale AI fact-checking tools are initially optimized for English and major European languages, this can lead to gaps in coverage and effectiveness for smaller languages and markets. Both imported and locally developed systems must contend with: 6. The Differential Impact of Copyright on AI Fact-Checking Tools Access to high-quality training data is essential for developing effective AI fact-checking tools. However, obtaining this data typically requires either formal licenses from content owners or reliance on specific copyright exceptions, allowing for legally sanctioned use. Copyright exceptions vary widely across jurisdictions. When it comes to Generative AI more broadly, recent court decisions illustrate both possibilities and limits in use of copyright exceptions to access training data. Notably, the US case of Thomson Reuters v. Ross Intelligence found that commercial use of copyrighted legal content for AI training may not qualify as fair use, whereas Kadrey v. Meta Platforms acknowledged “highly transformative” uses may be protected. In Europe, the Hamburg Regional Court and Hungarian Municipal Court have upheld TDM exceptions for AI research under EU law in non-commercial contexts, setting important precedents for lawful data mining and access. 7. Research Agenda: Questions for Future Study This brief analysis surfaces several areas urgently in need of deeper research and empirical investigation as AI-powered fact-checking spreads globally: 8. Conclusion The propagation of AI-powered fact-checking across borders—exemplified by the evolution and international spread of major partnership models—demonstrates both promise and persistent complexity. Language coverage, training data disparities, and legal constraints play significant roles in shaping who benefits from new technologies and whose voices are heard. The differential impact of copyright law is a newly urgent research challenge, likely affecting access, effectiveness, and equity in media verification globally. Ongoing empirical study and

Case Studies, TDM Cases

LATAM-GPT: A Culturally Sensitive Large Language Model for Latin America

LATAM-GPT is a groundbreaking large language model developed by the National Center for Artificial Intelligence (CENIA) in Chile, in partnership with over thirty institutions and twelve Latin American countries. The initiative aims to create an open-source AI model that reflects the region’s diverse cultures, languages—including Spanish, Portuguese, and Indigenous tongues—and social realities. Using ethically sourced, regionally contributed data, LATAM-GPT seeks to overcome the limitations and biases of global AI models predominantly trained on English data. The project is designed to empower local communities, preserve linguistic diversity, and support applications in education, public services, and beyond. Its development highlights the importance of ethical data practices, regional collaboration, and policy frameworks that foster inclusive, representative AI. A video version of this case study is available below. 1. Background: Digital Inequality and AI in Latin America Latin America faces unique challenges in digital inclusion, with significant linguistic, cultural, and infrastructural diversity across the region. Global AI models often fail to capture local nuances, leading to inaccuracies and reinforcing stereotypes. Many Indigenous and minority languages are underrepresented in mainstream technology, exacerbating digital exclusion. LATAM-GPT addresses these gaps by building a model tailored to the region’s needs, aiming to democratize access to advanced AI and promote technological sovereignty. 2. Technology and Approach LATAM-GPT is based on Llama 3, a state-of-the-art large language model architecture. The model is trained on more than 8 terabytes of regionally sourced text, encompassing Spanish, Portuguese, and Indigenous languages such as Rapa Nui. Training is conducted on a distributed network of computers across Latin America, including facilities at the University of Tarapacá in Chile and cloud-based platforms. The open-source nature of the project allows for transparency, adaptability, and broad participation from local developers and researchers. 3. Project Overview The project is coordinated by CENIA with support from the Chilean government, the regional development bank CAF, Amazon Web Services, and over thirty regional organizations. LATAM-GPT’s primary objective is to serve as a foundation for culturally relevant AI applications—such as chatbots, virtual public service assistants, and educational tools—rather than directly competing with global consumer products like ChatGPT. A key focus is the preservation and revitalization of Indigenous languages, with the first translation tools already developed for Rapa Nui and plans to expand to other languages. 4. Data Sources and Key Resources LATAM-GPT uses ethically sourced data contributed by governments, universities, libraries, archives, and community organizations across Latin America. This includes official documents, public records, literature, historical materials, and Indigenous language texts. All data is carefully curated to ensure privacy, consent, and cultural sensitivity. Unlike many global AI models, LATAM-GPT publishes its list of data sources, emphasizing transparency and ethical data governance. 5. Legal and Ethical Challenges Copyright and Licensing:The project relies on open-access and properly licensed materials, with explicit permissions from data contributors. This approach avoids the legal uncertainties faced by models that scrape data indiscriminately from the internet. Data Privacy and Consent:CENIA and its partners ensure that sensitive personal information is anonymized or excluded, and that data collection respects the rights and wishes of contributors, especially Indigenous communities. Inclusivity and Bias:By prioritizing local languages and cultural contexts, LATAM-GPT aims to reduce biases inherent in global models. Ongoing community engagement and feedback are integral to the model’s development and evaluation. 6. International and Regional Collaboration LATAM-GPT exemplifies pan-regional cooperation, with twelve countries and over thirty institutions contributing expertise, data, and infrastructure. The project has also engaged international partners and multilateral organizations, such as the Organization of American States and the Inter-American Development Bank, to support its mission of technological empowerment and digital inclusion. 7. Emerging Technology and Policy Issues LATAM-GPT’s open-source model sets a precedent for responsible AI development, emphasizing transparency, ethical data use, and regional self-determination. The project also highlights the need for robust digital infrastructure and continued investment to ensure equitable access to AI across Latin America. As with all large language models, ongoing attention to potential biases, data privacy, and the impact on local labor and education systems is essential. 8. National and Regional Legal Frameworks While LATAM-GPT’s ethical sourcing and licensing practices minimize legal risks, the project underscores the importance of harmonized copyright and data protection laws across Latin America. Policymakers are encouraged to develop frameworks that facilitate data sharing for socially beneficial AI, protect Indigenous knowledge, and promote open science. 9. Contractual or Policy Barriers Some challenges remain in securing permissions for certain data sources, particularly from private publishers or institutions with restrictive contracts. The project’s commitment to open licensing and community engagement helps mitigate these barriers, but continued advocacy is needed to expand access to valuable regional content. 10. Conclusions LATAM-GPT represents a major step forward in creating culturally sensitive, inclusive AI for Latin America. By centering ethical data practices, regional collaboration, and linguistic diversity, the project offers a model for other regions seeking to decolonize AI and ensure technology serves local needs. Continued investment, policy reform, and community participation will be crucial to realizing the full potential of LATAM-GPT and similar initiatives. Video Version Hear from the researchers themselves. Watch the video of this case study below.

Case Studies, TDM Cases

Blind South Africa: Apps for the Visually Impaired

An initiative led by Christo de Klerk at Blind South Africa focuses on promoting the use of accessible mobile and digital applications for blind and visually impaired people in South Africa. Highlighted at the 2025 Copyright and the Public Interest in Africa conference, the project addresses the importance of robust copyright exceptions and supportive legal frameworks to enable the development and use of apps that work in African languages, read aloud texts, describe images and films, and provide local information. The legal challenges – especially around copyright and licensing of accessible content and the use of visual works as AI training data – point to lessons for responsible, inclusive innovation and policy reform in assistive technology. A video version of this case study is available below. 1. Disability, Inequality, and Digital Exclusion in South Africa South Africa has a high prevalence of visual impairment and blindness, with over a million people affected. Barriers to education, employment, and public participation are exacerbated by inaccessible information and digital services. While assistive technologies offer new opportunities for inclusion, their effectiveness depends on being designed for accessibility, linguistic diversity, and local context. 2. Assistive Technology and Accessibility Assistive technology includes devices and software such as screen readers, braille displays, text-to-speech apps, navigation tools, and AI-powered applications that describe images and films. Development is shaped by international standards like the Web Content Accessibility Guidelines (WCAG) and local policy frameworks, ensuring that accessibility is built into digital tools. 3. Project Overview The BlindSA initiative focuses on raising awareness and supporting the use of existing accessible apps for blind and visually impaired users, particularly those supporting African languages and offering features like scene description and audio narration. Rather than developing new apps, the project identifies, tests, and disseminates information about effective tools. It also advocates for legal and policy changes, including strong copyright exceptions, to empower developers and users in South Africa’s multilingual communities. Community participation and user feedback are central to the approach. 4. Data Sources and Key Resources Apps promoted by the initiative draw on diverse data sources: public domain texts, government information, community-generated content, and web-scraped text and images, including photographs, artworks, and films. Many apps use AI to generate scene descriptions and audio narration, opening up previously inaccessible content. The initiative emphasizes tools that support African languages, providing read-aloud and real-time information (such as weather and news) in Zulu, Sotho, and others. However, reliance on copyrighted materials for AI training and accessible formats raises pressing legal questions about copyright exceptions, licensing, and the need for inclusive legal frameworks. 5. Legal and Ethical Challenges Copyright and Licensing:Many resources needed for accessible apps—books, newspapers, educational materials, artworks, photographs, and films—are protected by copyright. South Africa’s current law offers limited exceptions for accessible formats, making it difficult to legally convert works into braille, audio, large print, or described video without explicit permission. Outdated copyright laws have long denied blind South Africans equal access to information, highlighting the need for robust legal exceptions. Contractual Restrictions:Even when content is publicly available, licensing terms may prohibit adaptation or redistribution in accessible formats. Absence of Research and Accessibility Exceptions:Unlike countries that have ratified and implemented the Marrakesh Treaty, South Africa’s copyright regime remains restrictive. BlindSA’s Constitutional Court case challenged these limitations, emphasizing that the right to read is a fundamental human right and that lack of access excludes blind people from education, employment, and society. 6. The Marrakesh Treaty and the Struggle for Equality The Marrakesh Treaty, adopted by WIPO in 2013, requires signatory countries to create copyright exceptions for accessible formats and enables cross-border sharing. It is a milestone in addressing the “book famine” for blind people and affirms access to information as a fundamental right. South Africa signed the Treaty in 2014, but as of 2025, has not fully implemented it into national law, leaving the blind community at a disadvantage. BlindSA has been at the forefront of advocacy for these reforms, including a Constitutional Court challenge to the country’s copyright law. 7. Generative AI, Copyright Backlash, and the Accessibility Exception The rise of generative AI has sparked backlash from creators concerned about unauthorized use of their works as training data and the risk of AI-generated content substituting for original works. However, experts and advocates for the visually impaired emphasize the importance of copyright exceptions for accessibility. When AI is used for accessible formats—such as describing images or reading text aloud—there is no substitution effect, only an expansion of access and rights. Copyright policy and AI regulation should distinguish between commercial, substitutive uses of AI and socially beneficial uses that enable access for the visually impaired. The Marrakesh Treaty and proposed South African reforms recognize the need to balance creators’ interests with the transformative potential of AI for inclusion. 8. National Legal Reform The Copyright Amendment Bill proposes new exceptions for creating and distributing accessible format copies, as well as a general fair use provision. If enacted, these reforms would significantly improve access to information for the visually impaired in South Africa. 9. Contractual or Policy Barriers The initiative highlights the need for open licensing and permissions from publishers and content providers. Some government and educational materials remain inaccessible due to restrictive licenses or outdated contracts. 10. Conclusions Despite legal and policy challenges, BlindSA’s work has raised awareness of accessible apps and digital resources, including text-to-speech tools, navigation apps, and AI-powered scene description in South African languages. User participation is emphasized: “Accessibility is not a luxury or an afterthought—it is a right.” BlindSA’s experience highlights the urgent need for harmonized copyright exceptions, open standards, and inclusive digital policy. Policymakers, funders, and technology developers should prioritize accessibility and the rights of people with disabilities in all digital innovation efforts. Video Version Hear from the researchers themselves. Watch the video of this case study below.

Case Studies, TDM Cases

A Talking Health Chatbot in African Languages: DSFSI, University of Pretoria

A project at the Data Sciences for Social Impact (DSFSI) group, University of Pretoria, led by Professor Vukosi Marivate, is developing a talking health chatbot in African languages to provide accessible, culturally relevant health information to underserved communities. Central to the initiative is the planned use of health actuality TV and radio programmes produced by the South African Broadcasting Corporation (SABC) as training data, which introduces legal and ethical considerations around the use of publicly funded broadcast materials in AI. In the absence of South Africa’s pending copyright reforms, the project may be held up, pointing to the need for harmonised legal frameworks for AI for Good in Africa. 1. Health Inequality in South Africa and the Role of AI South Africa experiences some of the world’s highest health inequalities, shaped by historical, economic, and social factors. Urban centers are relatively well-resourced, but rural and peri-urban areas face critical shortages of health professionals and infrastructure. Language and literacy barriers further exacerbate disparities, with many unable to access health information in their mother tongue. Digital health interventions, especially those using artificial intelligence, offer a way to bridge these gaps by delivering accurate, on-demand health information in local languages. An AI-powered chatbot can empower users to make informed decisions, understand symptoms, and navigate the healthcare system, promoting greater equity in health outcomes. 2. What is Natural Language Processing (NLP)? Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to understand, interpret, and generate human language. NLP powers applications such as chatbots, voice assistants, and automated translation tools, making it crucial for digital inclusion, especially for speakers of underrepresented languages. 3. Project Overview The DSFSI health chatbot project aims to build an AI-powered conversational agent that delivers reliable health information in multiple African languages. The project’s mission is to address health literacy gaps and promote equitable access to vital information, particularly in communities where language and resource barriers persist. 4. Data Sources and Key Resources A distinctive feature of the project is its intention to use health actuality programmes broadcast by the SABC as primary training data. These programmes offer authentic dialogues in various African languages and cover a wide range of health topics relevant to local communities. However, the use of SABC broadcast material introduces significant legal and ethical complexities. The DSFSI team has spent years negotiating with the SABC to secure permission for use of these programmes as training data, but obtaining a definitive answer has proven elusive, leaving the project in a state of legal uncertainty. 5. Legal and Ethical Challenges Copyright and LicensingSABC’s health actuality programmes are protected by copyright, with all rights typically reserved by the broadcaster. Using these materials for AI training without explicit permission may constitute copyright infringement, regardless of educational or social impact goals. Contractual RestrictionsEven if SABC content is publicly accessible, the broadcaster’s terms of use or licensing agreements may explicitly prohibit reuse, redistribution, or data mining. Absence of Research ExceptionsSouth African copyright law currently lacks robust exceptions for text and data mining (TDM) or research use, unlike the European Union’s TDM exceptions or the United States’ Fair Use doctrine. Data Privacy and Community EngagementIf the chatbot is later trained on user interactions or collects personal health information, the project must also comply with the Protection of Personal Information Act (POPIA) and ensure meaningful informed consent from all participants. 6. Public Funding and the Public Interest Argument A significant dimension in negotiations with the SABC is the broadcaster’s funding structure. The SABC operates under a government charter and receives substantial public subsidies, with direct grants and bailouts accounting for about 27% of its 2022/2023 revenue. This strengthens the argument that SABC-produced content should be accessible for public interest projects, particularly those addressing urgent challenges like health inequality and language inclusion. Many in the research and innovation community contend that publicly funded content should be available for projects benefiting the broader public, especially those focused on health literacy and digital inclusion. 7. The WIPO Broadcasting Treaty: A New Layer of Complexity The international copyright landscape is evolving, with the World Intellectual Property Organization (WIPO) currently negotiating a Broadcasting Treaty. Recent drafts propose granting broadcasters—including public entities like the SABC—new, additional exclusive rights over their broadcast content, independent of the underlying copyright. Some drafts suggest these new rights could override or negate existing copyright exceptions and limitations, including those that might otherwise permit uses for research, education, or public interest projects. If adopted in its current form, the WIPO Broadcasting Treaty could further restrict the ability of researchers and innovators to use broadcast material for AI training, even when the content is publicly funded or serves a vital social function. 8. The Copyright Amendment Bill: Introducing Fair Use in South Africa A potentially transformative development is the Copyright Amendment Bill, which aims to introduce a Fair Use doctrine into South African law. Modeled after the U.S. system, Fair Use would allow limited use of copyrighted material without permission for research, teaching, and public interest innovation—the core activities of the DSFSI health chatbot initiative. If enacted, the Bill would provide a much-needed legal pathway for researchers to use materials like SABC broadcasts for AI training, provided the use is fair, non-commercial, and does not undermine the market for the original work. However, the Bill has faced significant opposition and delays, and is currently under review by the Constitutional Court, leaving its future uncertain. 9. Contractual or Policy Barriers In the absence of clear research exceptions, the project team must review and potentially negotiate with the SABC to secure permissions or licenses for the intended use of broadcast content. Without such agreements, the project may be forced to exclude valuable data sources or pivot to community-generated content. 10. Cross-Border and Multi-Jurisdictional Issues If the chatbot expands to use or serve content from other African countries, it will encounter a patchwork of copyright and data protection laws, further complicating compliance and cross-border collaboration. 11. Conclusions The challenges faced by the DSFSI health chatbot project

Case Studies, TDM Cases

Masakhane: Use of the JW300 Dataset for Natural Language Processing

The Masakhane Project showcases the transformative power of open, collaborative efforts in advancing natural language processing (NLP) for African languages. However, its reliance on the JW300 dataset—a vast multilingual corpus primarily comprising copyrighted biblical translations—uncovered significant legal and ethical challenges. These challenges focused on copyright restrictions, contract overrides, and the complexities of cross-border data use. This led to the discontinuation of JW300’s use within Masakhane, prompting a shift toward community-generated data. The experience illustrates the urgent need for robust copyright exceptions, clear legal frameworks, and ethical data sourcing to foster innovation and inclusivity in global NLP research. A video version of the case study is available below. 1. What Is Natural Language Processing? Natural Language Processing (NLP) is a branch of computer science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language, both written and spoken. NLP integrates computational linguistics with machine learning, deep learning, and statistical modeling, allowing machines to recognize patterns, extract meaning, and respond to natural language inputs in ways that approximate human comprehension. NLP underpins many everyday technologies, including search engines, digital assistants, chatbots, voice-operated GPS systems, and automated translation services. NLP is crucial for breaking down language barriers and has become integral to the digital transformation of societies worldwide1. 2. Masakhane Project Overview The Masakhane Project is an open-source initiative dedicated to advancing NLP for African languages. Its mission is to democratize access to NLP tools by building a continent-wide research community and developing datasets and benchmarks tailored to Africa’s linguistic diversity. By engaging researchers, linguists, and technologists across the continent, Masakhane ensures that African languages are not marginalized in the digital age. The project employs advanced sequence-to-sequence models, training them on parallel corpora to enable machine translation and other NLP tasks between African languages. The distributed network of contributors allows Masakhane to address the unique challenges of Africa’s linguistic landscape, where many languages lack sufficient digital resources. A notable achievement is the “Decolonise Science” project, which creates multilingual parallel corpora of African research by translating scientific papers from platforms like AfricArxiv into various African languages. This initiative enhances access to academic knowledge and promotes the use of African languages in scientific discourse, exemplifying Masakhane’s commitment to African-centric knowledge production and community benefit. 3. JW300 Dataset and Its Role The JW300 dataset was pivotal to Masakhane’s early work. It offers around 100,000 parallel sentences for each of over 300 African languages, mostly sourced from Jehovah’s Witnesses’ biblical translations. For many languages, JW300 is one of the only large-scale, aligned text sources available, making it invaluable for training baseline translation models such as English-to-Zulu or English-to-Yoruba. Masakhane utilized automated scripts for downloading and preprocessing JW300, including byte-pair encoding (BPE) to optimize model performance. Community contributions further expanded the dataset’s coverage, filling language gaps and improving resource quality. JW300’s widespread use enabled rapid progress in building machine translation models for underrepresented African languages. 4. Copyright Infringement Discovery Despite JW300’s open availability on platforms like OPUS, its use was legally problematic. In 2023, a legal audit by the Centre for Intellectual Property and Information Technology (CIPIT) in Nairobi revealed that the Jehovah’s Witnesses’ website explicitly prohibited text and data mining in its copyright notice. This meant Masakhane’s use of JW300 was unauthorized. When Masakhane’s organizers formally requested permission to use the data, their request was denied. This highlighted a fundamental tension between Masakhane’s open research ethos and the proprietary restrictions imposed by the dataset’s owners, forcing the project to reconsider its data strategy. 5. Copyright Exceptions and Limitations: The Role of TDM Exceptions and Fair Use Many jurisdictions provide copyright exceptions and limitations to balance creators’ rights with the needs of researchers and innovators. The European Union’s text and data mining (TDM) exceptions and the United States’ Fair Use doctrine are prominent examples. The EU’s Directive on Copyright in the Digital Single Market (Directive (EU) 2019/790) introduced two mandatory TDM exceptions. The first allows research organizations and cultural heritage institutions to conduct TDM for scientific research, regardless of contractual provisions. The second permits anyone to perform TDM for any purpose, provided the rights holder has not expressly opted out. Recent German case law clarified that an explicit reservation in a website’s terms is sufficient to exclude commercial TDM, but the exception remains robust for research contexts. In the U.S., the Fair Use doctrine allows limited use of copyrighted material without permission for purposes like criticism, comment, teaching, scholarship, or research. Courts increasingly recognize that using copyrighted works to train AI models can qualify as Fair Use, especially when the use is transformative and does not harm the original work’s market. Had Masakhane operated in the EU or U.S., these exceptions might have provided a legal basis for using JW300 for non-commercial research. However, most African countries lack clear TDM provisions or Fair Use recognition, exposing researchers to greater legal uncertainty and risk. The Masakhane experience underscores the need for African nations to adopt or clarify copyright exceptions that support research and digital innovation. 6. Contract Overrides Contract overrides occur when contractual terms—such as website terms of service—impose restrictions beyond those set by statutory copyright law. In JW300’s case, Jehovah’s Witnesses’ website terms explicitly prohibit text and data mining, overriding any potential exceptions or fair use provisions. For Masakhane, this meant that even if their use could be justified under fair use or research exceptions in some jurisdictions, the contractual terms imposed stricter limitations. Only in jurisdictions where statutes prevent contracts from overriding copyright exceptions (such as the EU’s TDM provision for research institutions) could these terms be challenged. This highlights the importance of reviewing all terms of service and data use agreements before using third-party datasets, especially in open, cross-border research projects. 7. Cross-Border Use The cross-border nature of datasets like JW300 adds further legal complexity, especially for open research projects with contributors across multiple countries. Masakhane operates in a pan-African context, with team members and users in different nations. Copyright and data use laws vary widely. What is permissible under fair

Artificial Intelligence, Blog, Case Studies, TDM Cases

Promoting AI for Good in the Global South – Highlights

Across Africa and Latin America, researchers are using Artificial Intelligence to solve pressing problems: from addressing health challenges and increasing access to information for underserved communities, to preserving languages and culture. This wave of “AI for Good” in the Global South faces a major difficulty: how to access good quality training data, which is scarce in the region and often subject to copyright restrictions. The most prominent AI companies are in the Global North and increasingly in China. These companies generally operate in jurisdictions with more permissive copyright exceptions, which enable Text and Data Mining (TDM), often the first step in training AI language models. The scale of data extraction and exploitation by a handful of AI mega-corporations has raised two pressing concerns: What about researchers and developers in the Global South and what about the creators and communities whose data is being used to train the AI models? Ethical AI: An Opportunity for the Global South? At a side event in April at WIPO, we showcased some models of ‘ethical AI’ aimed at: The event took place in Geneva in April 2025. This week we released a 15 minute highlights video. Training data and copyright issues At the start of the event, we cited two Text and Data Mining projects in Africa which have had difficulty in accessing training data due to copyright. The first was the Masakhane Project in Kenya, which used translations of the bible to develop Natural Language Processing tools in African languages. The second was the Data Sciences for Social Impact group at the University of Pretoria in South Africa who want to develop a health chatbot using broadcast TV shows as the training data. Data Farming, The NOODL license, Copyright Reform The following speakers then presented cutting edge work on how to solve copyright and other legal and ethical challenges facing public interest AI in Africa: The AI Act in Brazil: Remunerating Creators Carolina Miranda of the Ministry of Culture in Brazil indicated that her government is focused on passing a new law to ensure that those creators in Brazil whose work is used to train AI models are properly remunerated. Ms Miranda described how Big Tech in the Global North fails to properly pay creators in Brazil and elsewhere for the exploitation of their work. She confirmed that discussions of the AI Act are still ongoing and that non profit scientific research will be exempt from the remuneration provision. Jamie Love of Knowledge Ecology International suggested that to avoid the tendency of data providers to build a moat around their datasets, a useful model is the Common European Data Spaces being established by the European Commission. Four factors to Evaluate AI for Good At the end of the event we put forward the following four discriminating factors which might be used to evaluate to what extent copyright exceptions and limitations should allow developers and researchers to use training data in their applications: The panel was convened by the Via Libre Foundation in Argentina and ReCreate South Africa with support from the Program on Information Justice and Intellectual Property (PIJIP) at American University, and support from the Arcadia Fund. We are currently researching case studies on Text and Data Mining (TDM) and AI for Good in Africa and the Global South. Ben Cashdan is an economist and TV producer in Johannesburg and the Executive Director of Black Stripe Foundation. He also co-founded ReCreate South Africa.

Scroll to Top