Author name: Ben Cashdan

Centre News

25 Sept 2025: Beyond Adoption: Why it Matters and What is Next for Intellectual Property, Genetic Resources and Traditional Knowledge

On 25 September, former Director of the Traditional Knowledge Division at the World Intellectual Property Organization Wend Wendland will deliver a lecture on the landmark World Intellectual Property Organization (WIPO) Treaty on Intellectual Property, Genetic Resources and Traditional Knowledge, which was adopted in May 2024. He will address the treaty’s significance in policy making and knowledge governance. The talk is part of the Peter A. Jaszi Distinguished Lecture on Intellectual Property series, hosted by the Program on Information Justice and Intellectual Property at American University (PIJIP). It also serves as the inaugural event for the newly launched Geneva Centre on Knowledge Governance (see announcement below, in PDF).

Centre News

Tracking AI for Good in the Global South: Our TDM Case Studies are Online

The Geneva Centre on Knowledge Governance has been researching cases of Computational Research (or Text and Data Mining) aimed at public interest outcomes in Africa, Latin America and elsewhere. From a health chatbot in South Africa to a culturally-sensitive LLM for Chile and Latin America, we document the work of AI developers, especially in the Global South. This is part of our work to analyse whether and how copyright and AI policies need to contain provisions which consider the social impact of AI. See our Case Studies on Text and Data Mining.

Centre News

How will Gen-AI lawsuits Impact Copyright? We Help You Keep Track.

Are you trying to keep track of all the litigation by rights holders and creators against Generative AI companies? Litigation is under way covering a range of works, from musical compositions to books and journalism. In some cases the arguments presented and judgements handed down in these cases may begin to define the likely policy direction for copyright in the age of AI. In the first of a series of articles and papers on this topic, Geneva Centre counsel Andrés Izquierdo wrote an Infojustice blog entitled AI, Copyright, and the Future of Creativity: Notes from the Panama International Book Fair. Watch this space for more.

Centre News

It is Official: The Centre will Launch between September and November 2025

It is official – the Centre on Knowledge Governance will be launching in the coming months. In September and October 2025 we expect to publish some of our new research, including our case studies of Computational Research and AI for Good in the Global South. In November and December we will be present at the upcoming WIPO meetings, including the CDIP and SCCR47. Watch this space for announcements and for special events in Geneva and online. See our calendar for all upcoming events: https://knowledgegov.org/events/

Case Studies, TDM Cases

Ethical Sourcing of African Language Data: Lanafrica and the NOODL licence

Over 2,000 African languages are spoken by approximately 1.4 billion people on the continent, showing how linguistic diversity underpins African democracy, development, and cultural life. As artificial intelligence becomes central to progress in areas like healthcare, agriculture and education, new methods for collecting and sharing African language data are urgently needed—methods that support innovation while protecting community interests. In this video we look at two pioneering initiatives in the field of ethical data sourcing in Africa. The Mining Mindset: From Extraction to Partnership Historically, many dataset projects have adopted an extractive approach: gathering language data from communities with little return or recognition for the contributors. Professor Gloria Emezue, Research Lead at Lanafrica, diagnoses this approach: “We saw the whole idea of exploitation of data as something of a mindset. When you’re out to mine something, you’re actually exploiting that thing. We now say, okay, let’s have a new approach, which we call data farming.” Rather than “mining” communities for linguistic resources, Lanafrica works to “farm” data collaboratively—turning village squares into living laboratories, involving contributors directly, and ensuring that partnerships support further community-driven research. A practical example is NaijaVoices: over 1,800 hours of curated audio data in Nigerian languages, sourced through relationships built on respect and transparency. For Lanafrica, research access is free, but commercial use involves agreements that channel support back to the language communities, creating a virtuous cycle of benefit and ongoing dataset growth. Who Owns the Data—and Who Reaps the Rewards? Behind every local dataset is a web of legal and ethical questions about ownership and benefit sharing. Dr Melissa Omino, Director at the Centre for Intellectual Property and Information Technology Law (CIPIT), Strathmore University, notes the paradox that emerges when communities contribute their voices but lose control over how their words are used: “Where do you get African languages from? Your first source would be the communities that speak the African languages, right? So you go to the community, record them, the minute you make the recording, then you have created a copyrighted work, and whoever who has made the recording owns the copyright..” According to Professor Chijoke Okorie of the Data Science Law Lab at the University of Pretoria, What ends up happening is that the communities that created these datasets end up paying for products that are built on the datasets that they created. Addressing these inequities requires new licensing models that prioritize context, impact, and fair division of benefits. NOODL Licence: Reimagining Access and Justice The Nwolite Obodo Open Data Licence (NOODL) breaks new ground in recognising African realities. It creates a two-tier licensing system: broad, cost-free access for African users, and negotiations or royalties for wealthy, commercial or international users. Communities remain central, not secondary. Professor Chijoke Okorie describes the ethos of NOODL: “Nwolite Obodo is Igbo… for community raising, community development. We’ve grown from a research group at the University of Pretoria to a research network of researchers from both Anglophone and Francophone Africa.” NOODL’s design puts community benefit at the forefront. As Dr Melissa Omino explains: “If you are from a developing country, let’s say you’re from Brazil, and you want to use the data that is licenced under this regime, then you can use it under Creative Commons licence. If you are a multi-million dollar tech company that wants to use the licence, then you need to negotiate with the AI developers who collected the data. And the licence also ensures that the community gets a benefit.” Towards More Equitable Language Data If African language data is simply extracted and exported, local voices risk exclusion—both from representation and from the benefits of digital innovation. By refocusing efforts on partnership and fair licensing, projects like Lanafrica and NOODL demonstrate that ethical, sustainable language technology is achievable. Their experiences may help guide how other communities and researchers engage with digital resources and cultural heritage—ensuring language data works for those who speak it.

Case Studies

The Global Evolution of AI Fact-Checking: Copyright and Research Gaps

Abstract The propagation and evolution of AI-powered fact-checking tools worldwide has foregrounded the issue of access to quality training data. While many of the most widely adopted systems have originated in the Global North, there are also notable and growing efforts in the Global South to develop and adapt fact-checking technologies. As AI tools develop in new regions, they raise important questions about the differential impact of factors such as copyright on training data access, pointing to persistent obstacles and areas needing further research. 1. Introduction The spread of AI-powered fact-checking tools marks a significant shift in how misinformation is detected and addressed, transforming practices in journalism and civil society across continents. Systems that began in research institutions and newsrooms in the Global North are now being adapted and implemented in local contexts—including Africa, Latin America, and the Middle East. Although some widely used technologies originated in the North, emerging initiatives in the Global South are beginning to shape the landscape through locally designed and customized tools. This evolution brings renewed attention to longstanding questions: Which factors shape obstacles or biases within training data?  Do factors like resources constraints or copyright laws shape who can access, adapt, and benefit from these AI fact checking systems? As adoption extends to newer regions, the risk of unequal impact increases —not only due to linguistic and technical barriers, but also because of differences in copyright policies and regulatory environments. When it comes to copyright, there is a need for further research, especially as global partnerships and technology transfer intensify. 2. The Evolution and Impact of North-South AI Fact-Checking Partnerships Recent years have seen the emergence of global collaborations for AI-powered fact-checking bridging expertise between the Global North and South. One example began with the development and piloting of AI technologies by Full Fact—a UK-based fact-checking organization—and was then extended through partnerships with organisations such as Africa Check (Africa), Chequeado (Latin America), and the Arab Fact-Checking Network (Middle East). While most early implementation focused on English-language media monitoring and elections in developed countries, philanthropic support and local adaptations have greatly expanded reach and relevance. Scale and Spread: 3. Concrete Examples of AI Fact-Checking through Partnerships Nigeria (Africa Check) During Nigeria’s 2023 elections, Africa Check deployed AI-powered claim detection and transcription systems (developed in partnership with Full Fact) to monitor over 40,000 daily “fact-checkable claims” from more than 80 media sources. These claims were algorithmically screened for verification potential; a subset were selected for intensive human review and public debunking. This enabled fact-checkers to respond at unprecedented scale to viral misinformation and political rumors, with documented improvements in election monitoring and rapid rebuttal. Argentina (Chequeado) In Latin America, Chequeado integrated the partnership’s core AI technologies—adapting them into their “Chequeabot” system for Spanish-language fact-checking. Used during major televised debates and news events, Chequeabot supplied journalists with real-time claim detection and prioritization, flagging misleading or controversial statements for timely investigation and reporting. Middle East (Arab Fact-Checking Network) The Arab Fact-Checking Network and affiliated organizations have recently adopted the same AI tools for live tracking and countering misinformation in Arabic-language media. During national elections and crises, these technologies enabled rapid, high-volume monitoring of broadcast and online claims, supporting collaborative efforts to uphold media integrity and provide accurate information under pressure. 4. The Impact of Copyright Restrictions on Training Data Quality A critical challenge for AI-powered fact-checking is that legal and financial constraints, including copyright restrictions, can limit access to high-quality, rigorously verified journalism. Leading news sources may be behind paywalls or subject to licensing requirements, making them difficult or costly to include in AI training sets. By contrast, low-quality sensationalised content can be more easily available, promoted by online algorithms. This discrepancy between access to low quality vs. verified information sources partly explains the increased need for AI fact checking. But it also presents the danger that automated fact checking tools may be skewed though their training data toward unreliable sources. Without intentional curation and targeted copyright exceptions and/or ethical licensing, automated systems risk amplifying lower-quality information or missing key local knowledge. Projects must prioritize source verification, data provenance transparency, and partnerships with credible media to preserve integrity—while more research is needed on how copyright affects access and outcomes in different regions. 5. Obstacles Facing AI Fact-Checking Tools in the Global South Where large-scale AI fact-checking tools are initially optimized for English and major European languages, this can lead to gaps in coverage and effectiveness for smaller languages and markets. Both imported and locally developed systems must contend with: 6. The Differential Impact of Copyright on AI Fact-Checking Tools Access to high-quality training data is essential for developing effective AI fact-checking tools. However, obtaining this data typically requires either formal licenses from content owners or reliance on specific copyright exceptions, allowing for legally sanctioned use. Copyright exceptions vary widely across jurisdictions. When it comes to Generative AI more broadly, recent court decisions illustrate both possibilities and limits in use of copyright exceptions to access training data. Notably, the US case of Thomson Reuters v. Ross Intelligence found that commercial use of copyrighted legal content for AI training may not qualify as fair use, whereas Kadrey v. Meta Platforms acknowledged “highly transformative” uses may be protected. In Europe, the Hamburg Regional Court and Hungarian Municipal Court have upheld TDM exceptions for AI research under EU law in non-commercial contexts, setting important precedents for lawful data mining and access. 7. Research Agenda: Questions for Future Study This brief analysis surfaces several areas urgently in need of deeper research and empirical investigation as AI-powered fact-checking spreads globally: 8. Conclusion The propagation of AI-powered fact-checking across borders—exemplified by the evolution and international spread of major partnership models—demonstrates both promise and persistent complexity. Language coverage, training data disparities, and legal constraints play significant roles in shaping who benefits from new technologies and whose voices are heard. The differential impact of copyright law is a newly urgent research challenge, likely affecting access, effectiveness, and equity in media verification globally. Ongoing empirical study and

Case Studies, TDM Cases

A Talking Health Chatbot in African Languages: DSFSI, University of Pretoria

A project at the Data Sciences for Social Impact (DSFSI) group, University of Pretoria, led by Professor Vukosi Marivate, is developing a talking health chatbot in African languages to provide accessible, culturally relevant health information to underserved communities. Central to the initiative is the planned use of health actuality TV and radio programmes produced by the South African Broadcasting Corporation (SABC) as training data, which introduces legal and ethical considerations around the use of publicly funded broadcast materials in AI. In the absence of South Africa’s pending copyright reforms, the project may be held up, pointing to the need for harmonised legal frameworks for AI for Good in Africa. 1. Health Inequality in South Africa and the Role of AI South Africa experiences some of the world’s highest health inequalities, shaped by historical, economic, and social factors. Urban centers are relatively well-resourced, but rural and peri-urban areas face critical shortages of health professionals and infrastructure. Language and literacy barriers further exacerbate disparities, with many unable to access health information in their mother tongue. Digital health interventions, especially those using artificial intelligence, offer a way to bridge these gaps by delivering accurate, on-demand health information in local languages. An AI-powered chatbot can empower users to make informed decisions, understand symptoms, and navigate the healthcare system, promoting greater equity in health outcomes. 2. What is Natural Language Processing (NLP)? Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to understand, interpret, and generate human language. NLP powers applications such as chatbots, voice assistants, and automated translation tools, making it crucial for digital inclusion, especially for speakers of underrepresented languages. 3. Project Overview The DSFSI health chatbot project aims to build an AI-powered conversational agent that delivers reliable health information in multiple African languages. The project’s mission is to address health literacy gaps and promote equitable access to vital information, particularly in communities where language and resource barriers persist. 4. Data Sources and Key Resources A distinctive feature of the project is its intention to use health actuality programmes broadcast by the SABC as primary training data. These programmes offer authentic dialogues in various African languages and cover a wide range of health topics relevant to local communities. However, the use of SABC broadcast material introduces significant legal and ethical complexities. The DSFSI team has spent years negotiating with the SABC to secure permission for use of these programmes as training data, but obtaining a definitive answer has proven elusive, leaving the project in a state of legal uncertainty. 5. Legal and Ethical Challenges Copyright and LicensingSABC’s health actuality programmes are protected by copyright, with all rights typically reserved by the broadcaster. Using these materials for AI training without explicit permission may constitute copyright infringement, regardless of educational or social impact goals. Contractual RestrictionsEven if SABC content is publicly accessible, the broadcaster’s terms of use or licensing agreements may explicitly prohibit reuse, redistribution, or data mining. Absence of Research ExceptionsSouth African copyright law currently lacks robust exceptions for text and data mining (TDM) or research use, unlike the European Union’s TDM exceptions or the United States’ Fair Use doctrine. Data Privacy and Community EngagementIf the chatbot is later trained on user interactions or collects personal health information, the project must also comply with the Protection of Personal Information Act (POPIA) and ensure meaningful informed consent from all participants. 6. Public Funding and the Public Interest Argument A significant dimension in negotiations with the SABC is the broadcaster’s funding structure. The SABC operates under a government charter and receives substantial public subsidies, with direct grants and bailouts accounting for about 27% of its 2022/2023 revenue. This strengthens the argument that SABC-produced content should be accessible for public interest projects, particularly those addressing urgent challenges like health inequality and language inclusion. Many in the research and innovation community contend that publicly funded content should be available for projects benefiting the broader public, especially those focused on health literacy and digital inclusion. 7. The WIPO Broadcasting Treaty: A New Layer of Complexity The international copyright landscape is evolving, with the World Intellectual Property Organization (WIPO) currently negotiating a Broadcasting Treaty. Recent drafts propose granting broadcasters—including public entities like the SABC—new, additional exclusive rights over their broadcast content, independent of the underlying copyright. Some drafts suggest these new rights could override or negate existing copyright exceptions and limitations, including those that might otherwise permit uses for research, education, or public interest projects. If adopted in its current form, the WIPO Broadcasting Treaty could further restrict the ability of researchers and innovators to use broadcast material for AI training, even when the content is publicly funded or serves a vital social function. 8. The Copyright Amendment Bill: Introducing Fair Use in South Africa A potentially transformative development is the Copyright Amendment Bill, which aims to introduce a Fair Use doctrine into South African law. Modeled after the U.S. system, Fair Use would allow limited use of copyrighted material without permission for research, teaching, and public interest innovation—the core activities of the DSFSI health chatbot initiative. If enacted, the Bill would provide a much-needed legal pathway for researchers to use materials like SABC broadcasts for AI training, provided the use is fair, non-commercial, and does not undermine the market for the original work. However, the Bill has faced significant opposition and delays, and is currently under review by the Constitutional Court, leaving its future uncertain. 9. Contractual or Policy Barriers In the absence of clear research exceptions, the project team must review and potentially negotiate with the SABC to secure permissions or licenses for the intended use of broadcast content. Without such agreements, the project may be forced to exclude valuable data sources or pivot to community-generated content. 10. Cross-Border and Multi-Jurisdictional Issues If the chatbot expands to use or serve content from other African countries, it will encounter a patchwork of copyright and data protection laws, further complicating compliance and cross-border collaboration. 11. Conclusions The challenges faced by the DSFSI health chatbot project

Case Studies, TDM Cases

Masakhane: Use of the JW300 Dataset for Natural Language Processing

The Masakhane Project showcases the transformative power of open, collaborative efforts in advancing natural language processing (NLP) for African languages. However, its reliance on the JW300 dataset—a vast multilingual corpus primarily comprising copyrighted biblical translations—uncovered significant legal and ethical challenges. These challenges focused on copyright restrictions, contract overrides, and the complexities of cross-border data use. This led to the discontinuation of JW300’s use within Masakhane, prompting a shift toward community-generated data. The experience illustrates the urgent need for robust copyright exceptions, clear legal frameworks, and ethical data sourcing to foster innovation and inclusivity in global NLP research. A video version of the case study is available below. 1. What Is Natural Language Processing? Natural Language Processing (NLP) is a branch of computer science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language, both written and spoken. NLP integrates computational linguistics with machine learning, deep learning, and statistical modeling, allowing machines to recognize patterns, extract meaning, and respond to natural language inputs in ways that approximate human comprehension. NLP underpins many everyday technologies, including search engines, digital assistants, chatbots, voice-operated GPS systems, and automated translation services. NLP is crucial for breaking down language barriers and has become integral to the digital transformation of societies worldwide1. 2. Masakhane Project Overview The Masakhane Project is an open-source initiative dedicated to advancing NLP for African languages. Its mission is to democratize access to NLP tools by building a continent-wide research community and developing datasets and benchmarks tailored to Africa’s linguistic diversity. By engaging researchers, linguists, and technologists across the continent, Masakhane ensures that African languages are not marginalized in the digital age. The project employs advanced sequence-to-sequence models, training them on parallel corpora to enable machine translation and other NLP tasks between African languages. The distributed network of contributors allows Masakhane to address the unique challenges of Africa’s linguistic landscape, where many languages lack sufficient digital resources. A notable achievement is the “Decolonise Science” project, which creates multilingual parallel corpora of African research by translating scientific papers from platforms like AfricArxiv into various African languages. This initiative enhances access to academic knowledge and promotes the use of African languages in scientific discourse, exemplifying Masakhane’s commitment to African-centric knowledge production and community benefit. 3. JW300 Dataset and Its Role The JW300 dataset was pivotal to Masakhane’s early work. It offers around 100,000 parallel sentences for each of over 300 African languages, mostly sourced from Jehovah’s Witnesses’ biblical translations. For many languages, JW300 is one of the only large-scale, aligned text sources available, making it invaluable for training baseline translation models such as English-to-Zulu or English-to-Yoruba. Masakhane utilized automated scripts for downloading and preprocessing JW300, including byte-pair encoding (BPE) to optimize model performance. Community contributions further expanded the dataset’s coverage, filling language gaps and improving resource quality. JW300’s widespread use enabled rapid progress in building machine translation models for underrepresented African languages. 4. Copyright Infringement Discovery Despite JW300’s open availability on platforms like OPUS, its use was legally problematic. In 2023, a legal audit by the Centre for Intellectual Property and Information Technology (CIPIT) in Nairobi revealed that the Jehovah’s Witnesses’ website explicitly prohibited text and data mining in its copyright notice. This meant Masakhane’s use of JW300 was unauthorized. When Masakhane’s organizers formally requested permission to use the data, their request was denied. This highlighted a fundamental tension between Masakhane’s open research ethos and the proprietary restrictions imposed by the dataset’s owners, forcing the project to reconsider its data strategy. 5. Copyright Exceptions and Limitations: The Role of TDM Exceptions and Fair Use Many jurisdictions provide copyright exceptions and limitations to balance creators’ rights with the needs of researchers and innovators. The European Union’s text and data mining (TDM) exceptions and the United States’ Fair Use doctrine are prominent examples. The EU’s Directive on Copyright in the Digital Single Market (Directive (EU) 2019/790) introduced two mandatory TDM exceptions. The first allows research organizations and cultural heritage institutions to conduct TDM for scientific research, regardless of contractual provisions. The second permits anyone to perform TDM for any purpose, provided the rights holder has not expressly opted out. Recent German case law clarified that an explicit reservation in a website’s terms is sufficient to exclude commercial TDM, but the exception remains robust for research contexts. In the U.S., the Fair Use doctrine allows limited use of copyrighted material without permission for purposes like criticism, comment, teaching, scholarship, or research. Courts increasingly recognize that using copyrighted works to train AI models can qualify as Fair Use, especially when the use is transformative and does not harm the original work’s market. Had Masakhane operated in the EU or U.S., these exceptions might have provided a legal basis for using JW300 for non-commercial research. However, most African countries lack clear TDM provisions or Fair Use recognition, exposing researchers to greater legal uncertainty and risk. The Masakhane experience underscores the need for African nations to adopt or clarify copyright exceptions that support research and digital innovation. 6. Contract Overrides Contract overrides occur when contractual terms—such as website terms of service—impose restrictions beyond those set by statutory copyright law. In JW300’s case, Jehovah’s Witnesses’ website terms explicitly prohibit text and data mining, overriding any potential exceptions or fair use provisions. For Masakhane, this meant that even if their use could be justified under fair use or research exceptions in some jurisdictions, the contractual terms imposed stricter limitations. Only in jurisdictions where statutes prevent contracts from overriding copyright exceptions (such as the EU’s TDM provision for research institutions) could these terms be challenged. This highlights the importance of reviewing all terms of service and data use agreements before using third-party datasets, especially in open, cross-border research projects. 7. Cross-Border Use The cross-border nature of datasets like JW300 adds further legal complexity, especially for open research projects with contributors across multiple countries. Masakhane operates in a pan-African context, with team members and users in different nations. Copyright and data use laws vary widely. What is permissible under fair

Artificial Intelligence, Blog, Case Studies, TDM Cases

Promoting AI for Good in the Global South – Highlights

Across Africa and Latin America, researchers are using Artificial Intelligence to solve pressing problems: from addressing health challenges and increasing access to information for underserved communities, to preserving languages and culture. This wave of “AI for Good” in the Global South faces a major difficulty: how to access good quality training data, which is scarce in the region and often subject to copyright restrictions. The most prominent AI companies are in the Global North and increasingly in China. These companies generally operate in jurisdictions with more permissive copyright exceptions, which enable Text and Data Mining (TDM), often the first step in training AI language models. The scale of data extraction and exploitation by a handful of AI mega-corporations has raised two pressing concerns: What about researchers and developers in the Global South and what about the creators and communities whose data is being used to train the AI models? Ethical AI: An Opportunity for the Global South? At a side event in April at WIPO, we showcased some models of ‘ethical AI’ aimed at: The event took place in Geneva in April 2025. This week we released a 15 minute highlights video. Training data and copyright issues At the start of the event, we cited two Text and Data Mining projects in Africa which have had difficulty in accessing training data due to copyright. The first was the Masakhane Project in Kenya, which used translations of the bible to develop Natural Language Processing tools in African languages. The second was the Data Sciences for Social Impact group at the University of Pretoria in South Africa who want to develop a health chatbot using broadcast TV shows as the training data. Data Farming, The NOODL license, Copyright Reform The following speakers then presented cutting edge work on how to solve copyright and other legal and ethical challenges facing public interest AI in Africa: The AI Act in Brazil: Remunerating Creators Carolina Miranda of the Ministry of Culture in Brazil indicated that her government is focused on passing a new law to ensure that those creators in Brazil whose work is used to train AI models are properly remunerated. Ms Miranda described how Big Tech in the Global North fails to properly pay creators in Brazil and elsewhere for the exploitation of their work. She confirmed that discussions of the AI Act are still ongoing and that non profit scientific research will be exempt from the remuneration provision. Jamie Love of Knowledge Ecology International suggested that to avoid the tendency of data providers to build a moat around their datasets, a useful model is the Common European Data Spaces being established by the European Commission. Four factors to Evaluate AI for Good At the end of the event we put forward the following four discriminating factors which might be used to evaluate to what extent copyright exceptions and limitations should allow developers and researchers to use training data in their applications: The panel was convened by the Via Libre Foundation in Argentina and ReCreate South Africa with support from the Program on Information Justice and Intellectual Property (PIJIP) at American University, and support from the Arcadia Fund. We are currently researching case studies on Text and Data Mining (TDM) and AI for Good in Africa and the Global South. Ben Cashdan is an economist and TV producer in Johannesburg and the Executive Director of Black Stripe Foundation. He also co-founded ReCreate South Africa.

Blog

The NOODL license: Licensing African datasets to support research and AI in the Global South

With the increasing prominence of AI in all sectors of our economy and society, access to training data has become an important topic for practitioners and policy makers. In the Global North, a small number of large corporations with deep pockets have gained a head start in AI development, using training data from all over the world. But what about the creators and the communities whose creative works and languages are being used to train AI models? Shouldn’t they also derive some benefit? And what about AI developers in Africa and the Global South, who often struggle to gain access to training data? In an effort to try to level the playing field and ensure that AI supports the public interest, legal experts and practitioners in the Global South are developing new tools and protocols which aim to tackle these questions. One approach is to come up with new licenses for datasets. In a pathbreaking initiative, lawyers at the University of Strathmore in Nairobi have teamed up with their counterparts at the University of Pretoria to develop the NOODL license. NOODL  is a tiered license, building on Creative Commons, but with preferential terms for developers in Africa and the Global South. It also opens the door for recognition and a flow of benefits to creators and communities.  NOODL was inspired by researchers using African language works to develop Natural Language Processing systems, for purposes such as translation and language preservation. In this presentation, Dr Melissa Omino, the Head of the Centre for Intellectual Property and Information Technology Law (CIPIT) at Strathmore University in Nairobi, Kenya, talks about the NOODL license.  This presentation was originally delivered at the Conference on Copyright and the Public Interest in Africa and the Global South, in Johannesburg in February 2025. The full video of the presentation is available here.  Licensing African Datasets to ensure support for Research and AI in the Global South Dr Melissa Omino Introduction [Ben Cashdan]: We have Dr. Melissa Omino from CIPIT at the University of Strathmore in Nairobi to talk a little bit about a piece of work that they’re doing to try and ensure that the doors are not closed, that there is some opportunity to go on doing AI, doing research in Africa, but not necessarily throwing the doors open to everybody to do everything with all our stuff. Tell us a little bit about that.  [Dr Melissa Omino] Well, I really like that introduction. Yes, and that was the thinking behind it. Also, it’s interesting that I’m sitting next to Vukosi [Marivate, Professor of Computer Science at University of Pretoria] because Vukosi has a great influence on why the license exists. You’ve heard him talking about Masakhane and the language data that they needed. In the previous ReCreate conference, where we talked about the JW300 dataset, I hope you all know about that. If you don’t know, this is a plug for the ReCreate YouTube channel so that you can go and look at that story. That’s a Masakhane story. Background: The JW300 Dataset To make sure that we’re all together in the room, I’ll give you a short synopsis about the JW300 dataset. Vukosi, you can jump in if I get something wrong. Essentially, Masakhane, as a group of African AI developers, were conducting text data mining online for African languages so that they could build AI tools that solve African problems. We just had a wonderful example right now about the weather in Zulu, things like that. That’s what they wanted to cater for and the solutions they wanted to create. They went ahead and found [that there are] very minimal datasets or data available online for African problem solving, basically in African languages. But they did find one useful resource, which was on the Jehovah’s Witness website, where it had a lot of African languages because they had translated the Bible into different African languages. They were utilizing this in what was called the JW300 dataset. However, somehow, I don’t know how, you guys thought about copyright. They thought about copyright after text data mining. They thought, hey, can you actually use this dataset? That’s how they approached it. The first thing we did was look at the website. Copyright notices excluding text and data mining Most websites have a copyright notice, and a copyright notice lets you know what you can and can’t do with the copyright material that is presented on the website. The copyright notice on the Jehovah’s Witness website specifically excluded text data mining for the data that was there. We went back to Masakhane and said, sorry, you can’t use all this great work that you’ve collected. You can’t use it because it belongs to Jehovah’s Witness, and Jehovah’s Witness is an American company registered in Pennsylvania. They asked us, how is it that this is African languages from different parts of Africa, and the copyright belongs to an American company, and we cannot use the language? I said, well, that’s how the law works. And so they abandoned the JW300 dataset. This created a new avenue of research because Masakhane did not give up. They became innovative and decided to collect their own language datasets. And not only is Masakhane doing this, Kencorpus is also doing this by collecting their own language datasets. Building a Corpus of African Language Data But where do you get African languages from? People. You go to the people to collect the language, right? If you’re lucky, you can find a text that has the language, but not all African languages will have the text. Your first source would be the communities that speak the African languages, right? So you’re funded because collecting language is expensive – Vukosi can confirm.  He’s collecting 3,000 languages or 3,000 hours of languages. His budget is crazy to collect that. So you collect the language. You go to the community, record them however you want to do that. Copyright experts will tell you the minute you

Scroll to Top