Author name: Ben Cashdan

Case Studies, TDM Cases

A Talking Health Chatbot in African Languages: DSFSI, University of Pretoria

A project at the Data Sciences for Social Impact (DSFSI) group, University of Pretoria, led by Professor Vukosi Marivate, is developing a talking health chatbot in African languages to provide accessible, culturally relevant health information to underserved communities. Central to the initiative is the planned use of health actuality TV and radio programmes produced by the South African Broadcasting Corporation (SABC) as training data, which introduces legal and ethical considerations around the use of publicly funded broadcast materials in AI. In the absence of South Africa’s pending copyright reforms, the project may be held up, pointing to the need for harmonised legal frameworks for AI for Good in Africa. 1. Health Inequality in South Africa and the Role of AI South Africa experiences some of the world’s highest health inequalities, shaped by historical, economic, and social factors. Urban centers are relatively well-resourced, but rural and peri-urban areas face critical shortages of health professionals and infrastructure. Language and literacy barriers further exacerbate disparities, with many unable to access health information in their mother tongue. Digital health interventions, especially those using artificial intelligence, offer a way to bridge these gaps by delivering accurate, on-demand health information in local languages. An AI-powered chatbot can empower users to make informed decisions, understand symptoms, and navigate the healthcare system, promoting greater equity in health outcomes. 2. What is Natural Language Processing (NLP)? Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to understand, interpret, and generate human language. NLP powers applications such as chatbots, voice assistants, and automated translation tools, making it crucial for digital inclusion, especially for speakers of underrepresented languages. 3. Project Overview The DSFSI health chatbot project aims to build an AI-powered conversational agent that delivers reliable health information in multiple African languages. The project’s mission is to address health literacy gaps and promote equitable access to vital information, particularly in communities where language and resource barriers persist. 4. Data Sources and Key Resources A distinctive feature of the project is its intention to use health actuality programmes broadcast by the SABC as primary training data. These programmes offer authentic dialogues in various African languages and cover a wide range of health topics relevant to local communities. However, the use of SABC broadcast material introduces significant legal and ethical complexities. The DSFSI team has spent years negotiating with the SABC to secure permission for use of these programmes as training data, but obtaining a definitive answer has proven elusive, leaving the project in a state of legal uncertainty. 5. Legal and Ethical Challenges Copyright and LicensingSABC’s health actuality programmes are protected by copyright, with all rights typically reserved by the broadcaster. Using these materials for AI training without explicit permission may constitute copyright infringement, regardless of educational or social impact goals. Contractual RestrictionsEven if SABC content is publicly accessible, the broadcaster’s terms of use or licensing agreements may explicitly prohibit reuse, redistribution, or data mining. Absence of Research ExceptionsSouth African copyright law currently lacks robust exceptions for text and data mining (TDM) or research use, unlike the European Union’s TDM exceptions or the United States’ Fair Use doctrine. Data Privacy and Community EngagementIf the chatbot is later trained on user interactions or collects personal health information, the project must also comply with the Protection of Personal Information Act (POPIA) and ensure meaningful informed consent from all participants. 6. Public Funding and the Public Interest Argument A significant dimension in negotiations with the SABC is the broadcaster’s funding structure. The SABC operates under a government charter and receives substantial public subsidies, with direct grants and bailouts accounting for about 27% of its 2022/2023 revenue. This strengthens the argument that SABC-produced content should be accessible for public interest projects, particularly those addressing urgent challenges like health inequality and language inclusion. Many in the research and innovation community contend that publicly funded content should be available for projects benefiting the broader public, especially those focused on health literacy and digital inclusion. 7. The WIPO Broadcasting Treaty: A New Layer of Complexity The international copyright landscape is evolving, with the World Intellectual Property Organization (WIPO) currently negotiating a Broadcasting Treaty. Recent drafts propose granting broadcasters—including public entities like the SABC—new, additional exclusive rights over their broadcast content, independent of the underlying copyright. Some drafts suggest these new rights could override or negate existing copyright exceptions and limitations, including those that might otherwise permit uses for research, education, or public interest projects. If adopted in its current form, the WIPO Broadcasting Treaty could further restrict the ability of researchers and innovators to use broadcast material for AI training, even when the content is publicly funded or serves a vital social function. 8. The Copyright Amendment Bill: Introducing Fair Use in South Africa A potentially transformative development is the Copyright Amendment Bill, which aims to introduce a Fair Use doctrine into South African law. Modeled after the U.S. system, Fair Use would allow limited use of copyrighted material without permission for research, teaching, and public interest innovation—the core activities of the DSFSI health chatbot initiative. If enacted, the Bill would provide a much-needed legal pathway for researchers to use materials like SABC broadcasts for AI training, provided the use is fair, non-commercial, and does not undermine the market for the original work. However, the Bill has faced significant opposition and delays, and is currently under review by the Constitutional Court, leaving its future uncertain. 9. Contractual or Policy Barriers In the absence of clear research exceptions, the project team must review and potentially negotiate with the SABC to secure permissions or licenses for the intended use of broadcast content. Without such agreements, the project may be forced to exclude valuable data sources or pivot to community-generated content. 10. Cross-Border and Multi-Jurisdictional Issues If the chatbot expands to use or serve content from other African countries, it will encounter a patchwork of copyright and data protection laws, further complicating compliance and cross-border collaboration. 11. Conclusions The challenges faced by the DSFSI health chatbot project

Case Studies, TDM Cases

Masakhane: Use of the JW300 Dataset for Natural Language Processing

The Masakhane Project showcases the transformative power of open, collaborative efforts in advancing natural language processing (NLP) for African languages. However, its reliance on the JW300 dataset—a vast multilingual corpus primarily comprising copyrighted biblical translations—uncovered significant legal and ethical challenges. These challenges focused on copyright restrictions, contract overrides, and the complexities of cross-border data use. This led to the discontinuation of JW300’s use within Masakhane, prompting a shift toward community-generated data. The experience illustrates the urgent need for robust copyright exceptions, clear legal frameworks, and ethical data sourcing to foster innovation and inclusivity in global NLP research. A video version of the case study is available below. 1. What Is Natural Language Processing? Natural Language Processing (NLP) is a branch of computer science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language, both written and spoken. NLP integrates computational linguistics with machine learning, deep learning, and statistical modeling, allowing machines to recognize patterns, extract meaning, and respond to natural language inputs in ways that approximate human comprehension. NLP underpins many everyday technologies, including search engines, digital assistants, chatbots, voice-operated GPS systems, and automated translation services. NLP is crucial for breaking down language barriers and has become integral to the digital transformation of societies worldwide1. 2. Masakhane Project Overview The Masakhane Project is an open-source initiative dedicated to advancing NLP for African languages. Its mission is to democratize access to NLP tools by building a continent-wide research community and developing datasets and benchmarks tailored to Africa’s linguistic diversity. By engaging researchers, linguists, and technologists across the continent, Masakhane ensures that African languages are not marginalized in the digital age. The project employs advanced sequence-to-sequence models, training them on parallel corpora to enable machine translation and other NLP tasks between African languages. The distributed network of contributors allows Masakhane to address the unique challenges of Africa’s linguistic landscape, where many languages lack sufficient digital resources. A notable achievement is the “Decolonise Science” project, which creates multilingual parallel corpora of African research by translating scientific papers from platforms like AfricArxiv into various African languages. This initiative enhances access to academic knowledge and promotes the use of African languages in scientific discourse, exemplifying Masakhane’s commitment to African-centric knowledge production and community benefit. 3. JW300 Dataset and Its Role The JW300 dataset was pivotal to Masakhane’s early work. It offers around 100,000 parallel sentences for each of over 300 African languages, mostly sourced from Jehovah’s Witnesses’ biblical translations. For many languages, JW300 is one of the only large-scale, aligned text sources available, making it invaluable for training baseline translation models such as English-to-Zulu or English-to-Yoruba. Masakhane utilized automated scripts for downloading and preprocessing JW300, including byte-pair encoding (BPE) to optimize model performance. Community contributions further expanded the dataset’s coverage, filling language gaps and improving resource quality. JW300’s widespread use enabled rapid progress in building machine translation models for underrepresented African languages. 4. Copyright Infringement Discovery Despite JW300’s open availability on platforms like OPUS, its use was legally problematic. In 2023, a legal audit by the Centre for Intellectual Property and Information Technology (CIPIT) in Nairobi revealed that the Jehovah’s Witnesses’ website explicitly prohibited text and data mining in its copyright notice. This meant Masakhane’s use of JW300 was unauthorized. When Masakhane’s organizers formally requested permission to use the data, their request was denied. This highlighted a fundamental tension between Masakhane’s open research ethos and the proprietary restrictions imposed by the dataset’s owners, forcing the project to reconsider its data strategy. 5. Copyright Exceptions and Limitations: The Role of TDM Exceptions and Fair Use Many jurisdictions provide copyright exceptions and limitations to balance creators’ rights with the needs of researchers and innovators. The European Union’s text and data mining (TDM) exceptions and the United States’ Fair Use doctrine are prominent examples. The EU’s Directive on Copyright in the Digital Single Market (Directive (EU) 2019/790) introduced two mandatory TDM exceptions. The first allows research organizations and cultural heritage institutions to conduct TDM for scientific research, regardless of contractual provisions. The second permits anyone to perform TDM for any purpose, provided the rights holder has not expressly opted out. Recent German case law clarified that an explicit reservation in a website’s terms is sufficient to exclude commercial TDM, but the exception remains robust for research contexts. In the U.S., the Fair Use doctrine allows limited use of copyrighted material without permission for purposes like criticism, comment, teaching, scholarship, or research. Courts increasingly recognize that using copyrighted works to train AI models can qualify as Fair Use, especially when the use is transformative and does not harm the original work’s market. Had Masakhane operated in the EU or U.S., these exceptions might have provided a legal basis for using JW300 for non-commercial research. However, most African countries lack clear TDM provisions or Fair Use recognition, exposing researchers to greater legal uncertainty and risk. The Masakhane experience underscores the need for African nations to adopt or clarify copyright exceptions that support research and digital innovation. 6. Contract Overrides Contract overrides occur when contractual terms—such as website terms of service—impose restrictions beyond those set by statutory copyright law. In JW300’s case, Jehovah’s Witnesses’ website terms explicitly prohibit text and data mining, overriding any potential exceptions or fair use provisions. For Masakhane, this meant that even if their use could be justified under fair use or research exceptions in some jurisdictions, the contractual terms imposed stricter limitations. Only in jurisdictions where statutes prevent contracts from overriding copyright exceptions (such as the EU’s TDM provision for research institutions) could these terms be challenged. This highlights the importance of reviewing all terms of service and data use agreements before using third-party datasets, especially in open, cross-border research projects. 7. Cross-Border Use The cross-border nature of datasets like JW300 adds further legal complexity, especially for open research projects with contributors across multiple countries. Masakhane operates in a pan-African context, with team members and users in different nations. Copyright and data use laws vary widely. What is permissible under fair

Africa: Copyright & Public Interest, Artificial Intelligence, Blog, TDM Cases

Promoting AI for Good in the Global South – Highlights

by Ben Cashdan Across Africa and Latin America, researchers are using Artificial Intelligence to solve pressing problems: from addressing health challenges and increasing access to information for underserved communities, to preserving languages and culture. This wave of “AI for Good” in the Global South faces a major difficulty: how to access good quality training data, which is scarce in the region and often subject to copyright restrictions. The most prominent AI companies are in the Global North and increasingly in China. These companies generally operate in jurisdictions with more permissive copyright exceptions, which enable Text and Data Mining (TDM), often the first step in training AI language models. The scale of data extraction and exploitation by a handful of AI mega-corporations has raised two pressing concerns: What about researchers and developers in the Global South and what about the creators and communities whose data is being used to train the AI models? Ethical AI: An Opportunity for the Global South? At a side event in April at WIPO, we showcased some models of ‘ethical AI’ aimed at: The event took place in Geneva in April 2025. This week we released a 15 minute highlights video. Training data and copyright issues At the start of the event, we cited two Text and Data Mining projects in Africa which have had difficulty in accessing training data due to copyright. The first was the Masakhane Project in Kenya, which used translations of the bible to develop Natural Language Processing tools in African languages. The second was the Data Sciences for Social Impact group at the University of Pretoria in South Africa who want to develop a health chatbot using broadcast TV shows as the training data. Data Farming, The NOODL license, Copyright Reform The following speakers then presented cutting edge work on how to solve copyright and other legal and ethical challenges facing public interest AI in Africa: The AI Act in Brazil: Remunerating Creators Carolina Miranda of the Ministry of Culture in Brazil indicated that her government is focused on passing a new law to ensure that those creators in Brazil whose work is used to train AI models are properly remunerated. Ms Miranda described how Big Tech in the Global North fails to properly pay creators in Brazil and elsewhere for the exploitation of their work. She confirmed that discussions of the AI Act are still ongoing and that non profit scientific research will be exempt from the remuneration provision. Jamie Love of Knowledge Ecology International suggested that to avoid the tendency of data providers to build a moat around their datasets, a useful model is the Common European Data Spaces being established by the European Commission. Four factors to Evaluate AI for Good At the end of the event we put forward the following four discriminating factors which might be used to evaluate to what extent copyright exceptions and limitations should allow developers and researchers to use training data in their applications: The panel was convened by the Via Libre Foundation in Argentina and ReCreate South Africa with support from the Program on Information Justice and Intellectual Property (PIJIP) at American University, and support from the Arcadia Fund. We are currently researching case studies on Text and Data Mining (TDM) and AI for Good in Africa and the Global South. Ben Cashdan is an economist and TV producer in Johannesburg and the Executive Director of Black Stripe Foundation. He also co-founded ReCreate South Africa.

Blog, Case Studies, TDM Cases

The NOODL license: Licensing African datasets to support research and AI in the Global South

With the increasing prominence of AI in all sectors of our economy and society, access to training data has become an important topic for practitioners and policy makers. In the Global North, a small number of large corporations with deep pockets have gained a head start in AI development, using training data from all over the world. But what about the creators and the communities whose creative works and languages are being used to train AI models? Shouldn’t they also derive some benefit? And what about AI developers in Africa and the Global South, who often struggle to gain access to training data? In an effort to try to level the playing field and ensure that AI supports the public interest, legal experts and practitioners in the Global South are developing new tools and protocols which aim to tackle these questions. One approach is to come up with new licenses for datasets. In a pathbreaking initiative, lawyers at the University of Strathmore in Nairobi have teamed up with their counterparts at the University of Pretoria to develop the NOODL license. NOODL  is a tiered license, building on Creative Commons, but with preferential terms for developers in Africa and the Global South. It also opens the door for recognition and a flow of benefits to creators and communities.  NOODL was inspired by researchers using African language works to develop Natural Language Processing systems, for purposes such as translation and language preservation. In this presentation, Dr Melissa Omino, the Head of the Centre for Intellectual Property and Information Technology Law (CIPIT) at Strathmore University in Nairobi, Kenya, talks about the NOODL license.  This presentation was originally delivered at the Conference on Copyright and the Public Interest in Africa and the Global South, in Johannesburg in February 2025. The full video of the presentation is available here.  Licensing African Datasets to ensure support for Research and AI in the Global South Dr Melissa Omino Introduction [Ben Cashdan]: We have Dr. Melissa Omino from CIPIT at the University of Strathmore in Nairobi to talk a little bit about a piece of work that they’re doing to try and ensure that the doors are not closed, that there is some opportunity to go on doing AI, doing research in Africa, but not necessarily throwing the doors open to everybody to do everything with all our stuff. Tell us a little bit about that.  [Dr Melissa Omino] Well, I really like that introduction. Yes, and that was the thinking behind it. Also, it’s interesting that I’m sitting next to Vukosi [Marivate, Professor of Computer Science at University of Pretoria] because Vukosi has a great influence on why the license exists. You’ve heard him talking about Masakhane and the language data that they needed. In the previous ReCreate conference, where we talked about the JW300 dataset, I hope you all know about that. If you don’t know, this is a plug for the ReCreate YouTube channel so that you can go and look at that story. That’s a Masakhane story. Background: The JW300 Dataset To make sure that we’re all together in the room, I’ll give you a short synopsis about the JW300 dataset. Vukosi, you can jump in if I get something wrong. Essentially, Masakhane, as a group of African AI developers, were conducting text data mining online for African languages so that they could build AI tools that solve African problems. We just had a wonderful example right now about the weather in Zulu, things like that. That’s what they wanted to cater for and the solutions they wanted to create. They went ahead and found [that there are] very minimal datasets or data available online for African problem solving, basically in African languages. But they did find one useful resource, which was on the Jehovah’s Witness website, where it had a lot of African languages because they had translated the Bible into different African languages. They were utilizing this in what was called the JW300 dataset. However, somehow, I don’t know how, you guys thought about copyright. They thought about copyright after text data mining. They thought, hey, can you actually use this dataset? That’s how they approached it. The first thing we did was look at the website. Copyright notices excluding text and data mining Most websites have a copyright notice, and a copyright notice lets you know what you can and can’t do with the copyright material that is presented on the website. The copyright notice on the Jehovah’s Witness website specifically excluded text data mining for the data that was there. We went back to Masakhane and said, sorry, you can’t use all this great work that you’ve collected. You can’t use it because it belongs to Jehovah’s Witness, and Jehovah’s Witness is an American company registered in Pennsylvania. They asked us, how is it that this is African languages from different parts of Africa, and the copyright belongs to an American company, and we cannot use the language? I said, well, that’s how the law works. And so they abandoned the JW300 dataset. This created a new avenue of research because Masakhane did not give up. They became innovative and decided to collect their own language datasets. And not only is Masakhane doing this, Kencorpus is also doing this by collecting their own language datasets. Building a Corpus of African Language Data But where do you get African languages from? People. You go to the people to collect the language, right? If you’re lucky, you can find a text that has the language, but not all African languages will have the text. Your first source would be the communities that speak the African languages, right? So you’re funded because collecting language is expensive – Vukosi can confirm.  He’s collecting 3,000 languages or 3,000 hours of languages. His budget is crazy to collect that. So you collect the language. You go to the community, record them however you want to do that. Copyright experts will tell you the minute you

Africa: Copyright & Public Interest, Blog

INTERNATIONAL CONFERENCE IN SOUTH AFRICA HIGHLIGHTS THE URGENCY OF COPYRIGHT REFORMS

By ReCreate South Africa The cost of excluding billions of people in Africa and the Global South from access to knowledge could be huge for future generations. Knowledge-sharing in Africa is not always transactional, and the existing IP and copyright paradigms are not working well for creators or audiences on the continent. Creators are often poorly remunerated and in many cases audiences and students cannot afford access to knowledge and entertainment. Some global corporations take an extractive and exploitative approach to African creativity. Africa needs a new knowledge governance system to take into account the role of traditional and indigenous knowledge. These were the conclusions of an international conference entitled “Copyright and the Public Interest: Africa and the Global South” held last month in South Africa. The convenors were ReCreate South Africa, a coalition of creators and users of copyright material and the conference took place at the University of the Witwatersrand, Johannesburg (3 February), at the University of Cape Town Library (5 February) and at Innovation City (6 February). This conference was a follow-on from ReCreate’s inaugural conference on the “Right to Research in Africa” held at the University of Pretoria and the University of Cape Town in January 2023. Conference partnered with Program on Information Justice and Intellectual Property (PIJIP), the intergovernmental organisation, South Center, the University of Cape Town’s IP Unit, Mandela Institute, Law School and more. The conference was made possible by PIJIP and Arcadia, as well as Open Air. You can watch the full conference sessions online. IP as a tax on African Creativity: Protecting the Livelihoods of Creators In his opening input, Ben Cashdan, convener of ReCreate South Africa and former economic advisor to President Nelson Mandela, said that IP royalties are a de facto tax on Africa. “Income from IP royalties on all creativity, on all inventions around the world, topped $1 trillion in the past 24 months for the first time, and the United States gets about $130 billion of that. Africa gets a tiny fraction. Could that be because we don’t have creatives? Could that be because we don’t have actors, writers, musicians? Obviously not. The system operates in such a way that we don’t get the fruits of our labor here in this country and on this continent.” South African singer Mercy Pakela, whose music topped the charts in the 1980s, recounted how she had signed with record labels so that her music could be heard by music lovers around the world, but over 40 years later she still feels she has not received fair remuneration. Pakela said “I wish I knew then what I know now because then I did not know that it was business. I just wanted to be on stage. I thought it was just about talent.” Jack Devnarain, Chairperson of the South African Guild of Actors highlighted that many performers in Africa die poor due to the power imbalance between artists and their distributors or rights owners. He pointed a finger at those whose business models restrict the livelihoods of African performers and who are opposed to copyright reform.  “There are people, particularly the American-based organizations, the corporate giants in the Global North that are working very hard, and I’m talking about the publishers, the studios, the streamers, the broadcasters, that do not want South African actors to have a royalty earning right.” South Africa’s CAB and Why Teachers Need Fair Use The Copyright Amendment Bill (CAB), passed by Parliament in South Africa, but still awaiting the President’s signature, aims to solve the problem of exploitation of artists by introducing a right to fair royalties or equitable remuneration. The CAB also broadens access to knowledge for communities. Hence it addresses the needs of both constituencies, creators and users. The President has referred the Bill to the Constitutional Court over concerns that it may lead to arbitrary deprivation of property of rights holders. Advocate Iain Currie, lawyer for ReCreate raised questions around whether Intellectual Property is property in the traditional sense and also challenged the view that adjustments to Copyright laws in the public interest are arbitrary.  One of the main objectives of the CAB is to ensure that teachers and learners have access to educational materials, which is clearly a public interest goal. According to Dr Mugwena Maluleke, President of Education International, “there is a shocking shortage of 44 million teachers worldwide. A major catalyst for this shortage is the inability to attract and retain teachers due to inadequate conditions for providing quality teaching,” including a shortage of textbooks and learning materials. “Fair use in education is the key that unlocks the door to a world of knowledge and creativity, by allowing educators to utilize copyrighted materials in their teaching.”  Moreover “Fair copyright legislation is essential to enabling teachers to adapt and use the material and reach an increasingly diverse student body.”  Maluleke is also General Secretary of SADTU, the largest teachers union in South Africa, with a membership of over 250 000 teachers and workers.  Dr Sanya Samtani, Senior Researcher at the Mandela Institute in the Law Faculty at the University of the Witwatersrand, Johannesburg echoed these sentiments. “The Copyright Amendment Bill is an example of the state trying to regulate copyright, trying to fulfill its international obligations on copyright, and also its human rights obligations, which are constitutional and international in nature.” ‘AI for Good’ in Africa The conference considered the importance of Artificial Intelligence (AI) in solving the world’s most pressing challenges, including climate change, pandemic responses and countering misinformation. Generative AI has understandably raised alarm bells amongst creatives. Professor Vukosi Marivate, Chair of Data Science at the University of Pretoria, described a project in which broadcast TV shows in South Africa could be used to train AI models to educate local communities about primary health care in indigenous African languages. Marivate said that a power reset needs to take place between local communities and Big Tech based in the Global North. This will allow AI to be used to protect

Scroll to Top