TDM Cases

Case Studies, TDM Cases

Ethical Sourcing of African Language Data: Lanafrica and the NOODL licence

Over 2,000 African languages are spoken by approximately 1.4 billion people on the continent, showing how linguistic diversity underpins African democracy, development, and cultural life. As artificial intelligence becomes central to progress in areas like healthcare, agriculture and education, new methods for collecting and sharing African language data are urgently needed—methods that support innovation while protecting community interests. In this video we look at two pioneering initiatives in the field of ethical data sourcing in Africa. The Mining Mindset: From Extraction to Partnership Historically, many dataset projects have adopted an extractive approach: gathering language data from communities with little return or recognition for the contributors. Professor Gloria Emezue, Research Lead at Lanafrica, diagnoses this approach: “We saw the whole idea of exploitation of data as something of a mindset. When you’re out to mine something, you’re actually exploiting that thing. We now say, okay, let’s have a new approach, which we call data farming.” Rather than “mining” communities for linguistic resources, Lanafrica works to “farm” data collaboratively—turning village squares into living laboratories, involving contributors directly, and ensuring that partnerships support further community-driven research. A practical example is NaijaVoices: over 1,800 hours of curated audio data in Nigerian languages, sourced through relationships built on respect and transparency. For Lanafrica, research access is free, but commercial use involves agreements that channel support back to the language communities, creating a virtuous cycle of benefit and ongoing dataset growth. Who Owns the Data—and Who Reaps the Rewards? Behind every local dataset is a web of legal and ethical questions about ownership and benefit sharing. Dr Melissa Omino, Director at the Centre for Intellectual Property and Information Technology Law (CIPIT), Strathmore University, notes the paradox that emerges when communities contribute their voices but lose control over how their words are used: “Where do you get African languages from? Your first source would be the communities that speak the African languages, right? So you go to the community, record them, the minute you make the recording, then you have created a copyrighted work, and whoever who has made the recording owns the copyright..” According to Professor Chijoke Okorie of the Data Science Law Lab at the University of Pretoria, What ends up happening is that the communities that created these datasets end up paying for products that are built on the datasets that they created. Addressing these inequities requires new licensing models that prioritize context, impact, and fair division of benefits. NOODL Licence: Reimagining Access and Justice The Nwolite Obodo Open Data Licence (NOODL) breaks new ground in recognising African realities. It creates a two-tier licensing system: broad, cost-free access for African users, and negotiations or royalties for wealthy, commercial or international users. Communities remain central, not secondary. Professor Chijoke Okorie describes the ethos of NOODL: “Nwolite Obodo is Igbo… for community raising, community development. We’ve grown from a research group at the University of Pretoria to a research network of researchers from both Anglophone and Francophone Africa.” NOODL’s design puts community benefit at the forefront. As Dr Melissa Omino explains: “If you are from a developing country, let’s say you’re from Brazil, and you want to use the data that is licenced under this regime, then you can use it under Creative Commons licence. If you are a multi-million dollar tech company that wants to use the licence, then you need to negotiate with the AI developers who collected the data. And the licence also ensures that the community gets a benefit.” Towards More Equitable Language Data If African language data is simply extracted and exported, local voices risk exclusion—both from representation and from the benefits of digital innovation. By refocusing efforts on partnership and fair licensing, projects like Lanafrica and NOODL demonstrate that ethical, sustainable language technology is achievable. Their experiences may help guide how other communities and researchers engage with digital resources and cultural heritage—ensuring language data works for those who speak it.

Case Studies, TDM Cases

LATAM-GPT: A Culturally Sensitive Large Language Model for Latin America

LATAM-GPT is a groundbreaking large language model developed by the National Center for Artificial Intelligence (CENIA) in Chile, in partnership with over thirty institutions and twelve Latin American countries. The initiative aims to create an open-source AI model that reflects the region’s diverse cultures, languages—including Spanish, Portuguese, and Indigenous tongues—and social realities. Using ethically sourced, regionally contributed data, LATAM-GPT seeks to overcome the limitations and biases of global AI models predominantly trained on English data. The project is designed to empower local communities, preserve linguistic diversity, and support applications in education, public services, and beyond. Its development highlights the importance of ethical data practices, regional collaboration, and policy frameworks that foster inclusive, representative AI. A video version of this case study is available below. 1. Background: Digital Inequality and AI in Latin America Latin America faces unique challenges in digital inclusion, with significant linguistic, cultural, and infrastructural diversity across the region. Global AI models often fail to capture local nuances, leading to inaccuracies and reinforcing stereotypes. Many Indigenous and minority languages are underrepresented in mainstream technology, exacerbating digital exclusion. LATAM-GPT addresses these gaps by building a model tailored to the region’s needs, aiming to democratize access to advanced AI and promote technological sovereignty. 2. Technology and Approach LATAM-GPT is based on Llama 3, a state-of-the-art large language model architecture. The model is trained on more than 8 terabytes of regionally sourced text, encompassing Spanish, Portuguese, and Indigenous languages such as Rapa Nui. Training is conducted on a distributed network of computers across Latin America, including facilities at the University of Tarapacá in Chile and cloud-based platforms. The open-source nature of the project allows for transparency, adaptability, and broad participation from local developers and researchers. 3. Project Overview The project is coordinated by CENIA with support from the Chilean government, the regional development bank CAF, Amazon Web Services, and over thirty regional organizations. LATAM-GPT’s primary objective is to serve as a foundation for culturally relevant AI applications—such as chatbots, virtual public service assistants, and educational tools—rather than directly competing with global consumer products like ChatGPT. A key focus is the preservation and revitalization of Indigenous languages, with the first translation tools already developed for Rapa Nui and plans to expand to other languages. 4. Data Sources and Key Resources LATAM-GPT uses ethically sourced data contributed by governments, universities, libraries, archives, and community organizations across Latin America. This includes official documents, public records, literature, historical materials, and Indigenous language texts. All data is carefully curated to ensure privacy, consent, and cultural sensitivity. Unlike many global AI models, LATAM-GPT publishes its list of data sources, emphasizing transparency and ethical data governance. 5. Legal and Ethical Challenges Copyright and Licensing:The project relies on open-access and properly licensed materials, with explicit permissions from data contributors. This approach avoids the legal uncertainties faced by models that scrape data indiscriminately from the internet. Data Privacy and Consent:CENIA and its partners ensure that sensitive personal information is anonymized or excluded, and that data collection respects the rights and wishes of contributors, especially Indigenous communities. Inclusivity and Bias:By prioritizing local languages and cultural contexts, LATAM-GPT aims to reduce biases inherent in global models. Ongoing community engagement and feedback are integral to the model’s development and evaluation. 6. International and Regional Collaboration LATAM-GPT exemplifies pan-regional cooperation, with twelve countries and over thirty institutions contributing expertise, data, and infrastructure. The project has also engaged international partners and multilateral organizations, such as the Organization of American States and the Inter-American Development Bank, to support its mission of technological empowerment and digital inclusion. 7. Emerging Technology and Policy Issues LATAM-GPT’s open-source model sets a precedent for responsible AI development, emphasizing transparency, ethical data use, and regional self-determination. The project also highlights the need for robust digital infrastructure and continued investment to ensure equitable access to AI across Latin America. As with all large language models, ongoing attention to potential biases, data privacy, and the impact on local labor and education systems is essential. 8. National and Regional Legal Frameworks While LATAM-GPT’s ethical sourcing and licensing practices minimize legal risks, the project underscores the importance of harmonized copyright and data protection laws across Latin America. Policymakers are encouraged to develop frameworks that facilitate data sharing for socially beneficial AI, protect Indigenous knowledge, and promote open science. 9. Contractual or Policy Barriers Some challenges remain in securing permissions for certain data sources, particularly from private publishers or institutions with restrictive contracts. The project’s commitment to open licensing and community engagement helps mitigate these barriers, but continued advocacy is needed to expand access to valuable regional content. 10. Conclusions LATAM-GPT represents a major step forward in creating culturally sensitive, inclusive AI for Latin America. By centering ethical data practices, regional collaboration, and linguistic diversity, the project offers a model for other regions seeking to decolonize AI and ensure technology serves local needs. Continued investment, policy reform, and community participation will be crucial to realizing the full potential of LATAM-GPT and similar initiatives. Video Version Hear from the researchers themselves. Watch the video of this case study below.

Case Studies, TDM Cases

Blind South Africa: Apps for the Visually Impaired

An initiative led by Christo de Klerk at Blind South Africa focuses on promoting the use of accessible mobile and digital applications for blind and visually impaired people in South Africa. Highlighted at the 2025 Copyright and the Public Interest in Africa conference, the project addresses the importance of robust copyright exceptions and supportive legal frameworks to enable the development and use of apps that work in African languages, read aloud texts, describe images and films, and provide local information. The legal challenges – especially around copyright and licensing of accessible content and the use of visual works as AI training data – point to lessons for responsible, inclusive innovation and policy reform in assistive technology. A video version of this case study is available below. 1. Disability, Inequality, and Digital Exclusion in South Africa South Africa has a high prevalence of visual impairment and blindness, with over a million people affected. Barriers to education, employment, and public participation are exacerbated by inaccessible information and digital services. While assistive technologies offer new opportunities for inclusion, their effectiveness depends on being designed for accessibility, linguistic diversity, and local context. 2. Assistive Technology and Accessibility Assistive technology includes devices and software such as screen readers, braille displays, text-to-speech apps, navigation tools, and AI-powered applications that describe images and films. Development is shaped by international standards like the Web Content Accessibility Guidelines (WCAG) and local policy frameworks, ensuring that accessibility is built into digital tools. 3. Project Overview The BlindSA initiative focuses on raising awareness and supporting the use of existing accessible apps for blind and visually impaired users, particularly those supporting African languages and offering features like scene description and audio narration. Rather than developing new apps, the project identifies, tests, and disseminates information about effective tools. It also advocates for legal and policy changes, including strong copyright exceptions, to empower developers and users in South Africa’s multilingual communities. Community participation and user feedback are central to the approach. 4. Data Sources and Key Resources Apps promoted by the initiative draw on diverse data sources: public domain texts, government information, community-generated content, and web-scraped text and images, including photographs, artworks, and films. Many apps use AI to generate scene descriptions and audio narration, opening up previously inaccessible content. The initiative emphasizes tools that support African languages, providing read-aloud and real-time information (such as weather and news) in Zulu, Sotho, and others. However, reliance on copyrighted materials for AI training and accessible formats raises pressing legal questions about copyright exceptions, licensing, and the need for inclusive legal frameworks. 5. Legal and Ethical Challenges Copyright and Licensing:Many resources needed for accessible apps—books, newspapers, educational materials, artworks, photographs, and films—are protected by copyright. South Africa’s current law offers limited exceptions for accessible formats, making it difficult to legally convert works into braille, audio, large print, or described video without explicit permission. Outdated copyright laws have long denied blind South Africans equal access to information, highlighting the need for robust legal exceptions. Contractual Restrictions:Even when content is publicly available, licensing terms may prohibit adaptation or redistribution in accessible formats. Absence of Research and Accessibility Exceptions:Unlike countries that have ratified and implemented the Marrakesh Treaty, South Africa’s copyright regime remains restrictive. BlindSA’s Constitutional Court case challenged these limitations, emphasizing that the right to read is a fundamental human right and that lack of access excludes blind people from education, employment, and society. 6. The Marrakesh Treaty and the Struggle for Equality The Marrakesh Treaty, adopted by WIPO in 2013, requires signatory countries to create copyright exceptions for accessible formats and enables cross-border sharing. It is a milestone in addressing the “book famine” for blind people and affirms access to information as a fundamental right. South Africa signed the Treaty in 2014, but as of 2025, has not fully implemented it into national law, leaving the blind community at a disadvantage. BlindSA has been at the forefront of advocacy for these reforms, including a Constitutional Court challenge to the country’s copyright law. 7. Generative AI, Copyright Backlash, and the Accessibility Exception The rise of generative AI has sparked backlash from creators concerned about unauthorized use of their works as training data and the risk of AI-generated content substituting for original works. However, experts and advocates for the visually impaired emphasize the importance of copyright exceptions for accessibility. When AI is used for accessible formats—such as describing images or reading text aloud—there is no substitution effect, only an expansion of access and rights. Copyright policy and AI regulation should distinguish between commercial, substitutive uses of AI and socially beneficial uses that enable access for the visually impaired. The Marrakesh Treaty and proposed South African reforms recognize the need to balance creators’ interests with the transformative potential of AI for inclusion. 8. National Legal Reform The Copyright Amendment Bill proposes new exceptions for creating and distributing accessible format copies, as well as a general fair use provision. If enacted, these reforms would significantly improve access to information for the visually impaired in South Africa. 9. Contractual or Policy Barriers The initiative highlights the need for open licensing and permissions from publishers and content providers. Some government and educational materials remain inaccessible due to restrictive licenses or outdated contracts. 10. Conclusions Despite legal and policy challenges, BlindSA’s work has raised awareness of accessible apps and digital resources, including text-to-speech tools, navigation apps, and AI-powered scene description in South African languages. User participation is emphasized: “Accessibility is not a luxury or an afterthought—it is a right.” BlindSA’s experience highlights the urgent need for harmonized copyright exceptions, open standards, and inclusive digital policy. Policymakers, funders, and technology developers should prioritize accessibility and the rights of people with disabilities in all digital innovation efforts. Video Version Hear from the researchers themselves. Watch the video of this case study below.

Case Studies, TDM Cases

A Talking Health Chatbot in African Languages: DSFSI, University of Pretoria

A project at the Data Sciences for Social Impact (DSFSI) group, University of Pretoria, led by Professor Vukosi Marivate, is developing a talking health chatbot in African languages to provide accessible, culturally relevant health information to underserved communities. Central to the initiative is the planned use of health actuality TV and radio programmes produced by the South African Broadcasting Corporation (SABC) as training data, which introduces legal and ethical considerations around the use of publicly funded broadcast materials in AI. In the absence of South Africa’s pending copyright reforms, the project may be held up, pointing to the need for harmonised legal frameworks for AI for Good in Africa. 1. Health Inequality in South Africa and the Role of AI South Africa experiences some of the world’s highest health inequalities, shaped by historical, economic, and social factors. Urban centers are relatively well-resourced, but rural and peri-urban areas face critical shortages of health professionals and infrastructure. Language and literacy barriers further exacerbate disparities, with many unable to access health information in their mother tongue. Digital health interventions, especially those using artificial intelligence, offer a way to bridge these gaps by delivering accurate, on-demand health information in local languages. An AI-powered chatbot can empower users to make informed decisions, understand symptoms, and navigate the healthcare system, promoting greater equity in health outcomes. 2. What is Natural Language Processing (NLP)? Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to understand, interpret, and generate human language. NLP powers applications such as chatbots, voice assistants, and automated translation tools, making it crucial for digital inclusion, especially for speakers of underrepresented languages. 3. Project Overview The DSFSI health chatbot project aims to build an AI-powered conversational agent that delivers reliable health information in multiple African languages. The project’s mission is to address health literacy gaps and promote equitable access to vital information, particularly in communities where language and resource barriers persist. 4. Data Sources and Key Resources A distinctive feature of the project is its intention to use health actuality programmes broadcast by the SABC as primary training data. These programmes offer authentic dialogues in various African languages and cover a wide range of health topics relevant to local communities. However, the use of SABC broadcast material introduces significant legal and ethical complexities. The DSFSI team has spent years negotiating with the SABC to secure permission for use of these programmes as training data, but obtaining a definitive answer has proven elusive, leaving the project in a state of legal uncertainty. 5. Legal and Ethical Challenges Copyright and LicensingSABC’s health actuality programmes are protected by copyright, with all rights typically reserved by the broadcaster. Using these materials for AI training without explicit permission may constitute copyright infringement, regardless of educational or social impact goals. Contractual RestrictionsEven if SABC content is publicly accessible, the broadcaster’s terms of use or licensing agreements may explicitly prohibit reuse, redistribution, or data mining. Absence of Research ExceptionsSouth African copyright law currently lacks robust exceptions for text and data mining (TDM) or research use, unlike the European Union’s TDM exceptions or the United States’ Fair Use doctrine. Data Privacy and Community EngagementIf the chatbot is later trained on user interactions or collects personal health information, the project must also comply with the Protection of Personal Information Act (POPIA) and ensure meaningful informed consent from all participants. 6. Public Funding and the Public Interest Argument A significant dimension in negotiations with the SABC is the broadcaster’s funding structure. The SABC operates under a government charter and receives substantial public subsidies, with direct grants and bailouts accounting for about 27% of its 2022/2023 revenue. This strengthens the argument that SABC-produced content should be accessible for public interest projects, particularly those addressing urgent challenges like health inequality and language inclusion. Many in the research and innovation community contend that publicly funded content should be available for projects benefiting the broader public, especially those focused on health literacy and digital inclusion. 7. The WIPO Broadcasting Treaty: A New Layer of Complexity The international copyright landscape is evolving, with the World Intellectual Property Organization (WIPO) currently negotiating a Broadcasting Treaty. Recent drafts propose granting broadcasters—including public entities like the SABC—new, additional exclusive rights over their broadcast content, independent of the underlying copyright. Some drafts suggest these new rights could override or negate existing copyright exceptions and limitations, including those that might otherwise permit uses for research, education, or public interest projects. If adopted in its current form, the WIPO Broadcasting Treaty could further restrict the ability of researchers and innovators to use broadcast material for AI training, even when the content is publicly funded or serves a vital social function. 8. The Copyright Amendment Bill: Introducing Fair Use in South Africa A potentially transformative development is the Copyright Amendment Bill, which aims to introduce a Fair Use doctrine into South African law. Modeled after the U.S. system, Fair Use would allow limited use of copyrighted material without permission for research, teaching, and public interest innovation—the core activities of the DSFSI health chatbot initiative. If enacted, the Bill would provide a much-needed legal pathway for researchers to use materials like SABC broadcasts for AI training, provided the use is fair, non-commercial, and does not undermine the market for the original work. However, the Bill has faced significant opposition and delays, and is currently under review by the Constitutional Court, leaving its future uncertain. 9. Contractual or Policy Barriers In the absence of clear research exceptions, the project team must review and potentially negotiate with the SABC to secure permissions or licenses for the intended use of broadcast content. Without such agreements, the project may be forced to exclude valuable data sources or pivot to community-generated content. 10. Cross-Border and Multi-Jurisdictional Issues If the chatbot expands to use or serve content from other African countries, it will encounter a patchwork of copyright and data protection laws, further complicating compliance and cross-border collaboration. 11. Conclusions The challenges faced by the DSFSI health chatbot project

Case Studies, TDM Cases

Masakhane: Use of the JW300 Dataset for Natural Language Processing

The Masakhane Project showcases the transformative power of open, collaborative efforts in advancing natural language processing (NLP) for African languages. However, its reliance on the JW300 dataset—a vast multilingual corpus primarily comprising copyrighted biblical translations—uncovered significant legal and ethical challenges. These challenges focused on copyright restrictions, contract overrides, and the complexities of cross-border data use. This led to the discontinuation of JW300’s use within Masakhane, prompting a shift toward community-generated data. The experience illustrates the urgent need for robust copyright exceptions, clear legal frameworks, and ethical data sourcing to foster innovation and inclusivity in global NLP research. A video version of the case study is available below. 1. What Is Natural Language Processing? Natural Language Processing (NLP) is a branch of computer science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language, both written and spoken. NLP integrates computational linguistics with machine learning, deep learning, and statistical modeling, allowing machines to recognize patterns, extract meaning, and respond to natural language inputs in ways that approximate human comprehension. NLP underpins many everyday technologies, including search engines, digital assistants, chatbots, voice-operated GPS systems, and automated translation services. NLP is crucial for breaking down language barriers and has become integral to the digital transformation of societies worldwide1. 2. Masakhane Project Overview The Masakhane Project is an open-source initiative dedicated to advancing NLP for African languages. Its mission is to democratize access to NLP tools by building a continent-wide research community and developing datasets and benchmarks tailored to Africa’s linguistic diversity. By engaging researchers, linguists, and technologists across the continent, Masakhane ensures that African languages are not marginalized in the digital age. The project employs advanced sequence-to-sequence models, training them on parallel corpora to enable machine translation and other NLP tasks between African languages. The distributed network of contributors allows Masakhane to address the unique challenges of Africa’s linguistic landscape, where many languages lack sufficient digital resources. A notable achievement is the “Decolonise Science” project, which creates multilingual parallel corpora of African research by translating scientific papers from platforms like AfricArxiv into various African languages. This initiative enhances access to academic knowledge and promotes the use of African languages in scientific discourse, exemplifying Masakhane’s commitment to African-centric knowledge production and community benefit. 3. JW300 Dataset and Its Role The JW300 dataset was pivotal to Masakhane’s early work. It offers around 100,000 parallel sentences for each of over 300 African languages, mostly sourced from Jehovah’s Witnesses’ biblical translations. For many languages, JW300 is one of the only large-scale, aligned text sources available, making it invaluable for training baseline translation models such as English-to-Zulu or English-to-Yoruba. Masakhane utilized automated scripts for downloading and preprocessing JW300, including byte-pair encoding (BPE) to optimize model performance. Community contributions further expanded the dataset’s coverage, filling language gaps and improving resource quality. JW300’s widespread use enabled rapid progress in building machine translation models for underrepresented African languages. 4. Copyright Infringement Discovery Despite JW300’s open availability on platforms like OPUS, its use was legally problematic. In 2023, a legal audit by the Centre for Intellectual Property and Information Technology (CIPIT) in Nairobi revealed that the Jehovah’s Witnesses’ website explicitly prohibited text and data mining in its copyright notice. This meant Masakhane’s use of JW300 was unauthorized. When Masakhane’s organizers formally requested permission to use the data, their request was denied. This highlighted a fundamental tension between Masakhane’s open research ethos and the proprietary restrictions imposed by the dataset’s owners, forcing the project to reconsider its data strategy. 5. Copyright Exceptions and Limitations: The Role of TDM Exceptions and Fair Use Many jurisdictions provide copyright exceptions and limitations to balance creators’ rights with the needs of researchers and innovators. The European Union’s text and data mining (TDM) exceptions and the United States’ Fair Use doctrine are prominent examples. The EU’s Directive on Copyright in the Digital Single Market (Directive (EU) 2019/790) introduced two mandatory TDM exceptions. The first allows research organizations and cultural heritage institutions to conduct TDM for scientific research, regardless of contractual provisions. The second permits anyone to perform TDM for any purpose, provided the rights holder has not expressly opted out. Recent German case law clarified that an explicit reservation in a website’s terms is sufficient to exclude commercial TDM, but the exception remains robust for research contexts. In the U.S., the Fair Use doctrine allows limited use of copyrighted material without permission for purposes like criticism, comment, teaching, scholarship, or research. Courts increasingly recognize that using copyrighted works to train AI models can qualify as Fair Use, especially when the use is transformative and does not harm the original work’s market. Had Masakhane operated in the EU or U.S., these exceptions might have provided a legal basis for using JW300 for non-commercial research. However, most African countries lack clear TDM provisions or Fair Use recognition, exposing researchers to greater legal uncertainty and risk. The Masakhane experience underscores the need for African nations to adopt or clarify copyright exceptions that support research and digital innovation. 6. Contract Overrides Contract overrides occur when contractual terms—such as website terms of service—impose restrictions beyond those set by statutory copyright law. In JW300’s case, Jehovah’s Witnesses’ website terms explicitly prohibit text and data mining, overriding any potential exceptions or fair use provisions. For Masakhane, this meant that even if their use could be justified under fair use or research exceptions in some jurisdictions, the contractual terms imposed stricter limitations. Only in jurisdictions where statutes prevent contracts from overriding copyright exceptions (such as the EU’s TDM provision for research institutions) could these terms be challenged. This highlights the importance of reviewing all terms of service and data use agreements before using third-party datasets, especially in open, cross-border research projects. 7. Cross-Border Use The cross-border nature of datasets like JW300 adds further legal complexity, especially for open research projects with contributors across multiple countries. Masakhane operates in a pan-African context, with team members and users in different nations. Copyright and data use laws vary widely. What is permissible under fair

Artificial Intelligence, Blog, Case Studies, TDM Cases

Promoting AI for Good in the Global South – Highlights

Across Africa and Latin America, researchers are using Artificial Intelligence to solve pressing problems: from addressing health challenges and increasing access to information for underserved communities, to preserving languages and culture. This wave of “AI for Good” in the Global South faces a major difficulty: how to access good quality training data, which is scarce in the region and often subject to copyright restrictions. The most prominent AI companies are in the Global North and increasingly in China. These companies generally operate in jurisdictions with more permissive copyright exceptions, which enable Text and Data Mining (TDM), often the first step in training AI language models. The scale of data extraction and exploitation by a handful of AI mega-corporations has raised two pressing concerns: What about researchers and developers in the Global South and what about the creators and communities whose data is being used to train the AI models? Ethical AI: An Opportunity for the Global South? At a side event in April at WIPO, we showcased some models of ‘ethical AI’ aimed at: The event took place in Geneva in April 2025. This week we released a 15 minute highlights video. Training data and copyright issues At the start of the event, we cited two Text and Data Mining projects in Africa which have had difficulty in accessing training data due to copyright. The first was the Masakhane Project in Kenya, which used translations of the bible to develop Natural Language Processing tools in African languages. The second was the Data Sciences for Social Impact group at the University of Pretoria in South Africa who want to develop a health chatbot using broadcast TV shows as the training data. Data Farming, The NOODL license, Copyright Reform The following speakers then presented cutting edge work on how to solve copyright and other legal and ethical challenges facing public interest AI in Africa: The AI Act in Brazil: Remunerating Creators Carolina Miranda of the Ministry of Culture in Brazil indicated that her government is focused on passing a new law to ensure that those creators in Brazil whose work is used to train AI models are properly remunerated. Ms Miranda described how Big Tech in the Global North fails to properly pay creators in Brazil and elsewhere for the exploitation of their work. She confirmed that discussions of the AI Act are still ongoing and that non profit scientific research will be exempt from the remuneration provision. Jamie Love of Knowledge Ecology International suggested that to avoid the tendency of data providers to build a moat around their datasets, a useful model is the Common European Data Spaces being established by the European Commission. Four factors to Evaluate AI for Good At the end of the event we put forward the following four discriminating factors which might be used to evaluate to what extent copyright exceptions and limitations should allow developers and researchers to use training data in their applications: The panel was convened by the Via Libre Foundation in Argentina and ReCreate South Africa with support from the Program on Information Justice and Intellectual Property (PIJIP) at American University, and support from the Arcadia Fund. We are currently researching case studies on Text and Data Mining (TDM) and AI for Good in Africa and the Global South. Ben Cashdan is an economist and TV producer in Johannesburg and the Executive Director of Black Stripe Foundation. He also co-founded ReCreate South Africa.

Scroll to Top