Ethical Sourcing of African Language Data: Lanafrica and the NOODL licence
Over 2,000 African languages are spoken by approximately 1.4 billion people on the continent, showing how linguistic diversity underpins African democracy, development, and cultural life. As artificial intelligence becomes central to progress in areas like healthcare, agriculture and education, new methods for collecting and sharing African language data are urgently needed—methods that support innovation while protecting community interests. In this video we look at two pioneering initiatives in the field of ethical data sourcing in Africa. The Mining Mindset: From Extraction to Partnership Historically, many dataset projects have adopted an extractive approach: gathering language data from communities with little return or recognition for the contributors. Professor Gloria Emezue, Research Lead at Lanafrica, diagnoses this approach: “We saw the whole idea of exploitation of data as something of a mindset. When you’re out to mine something, you’re actually exploiting that thing. We now say, okay, let’s have a new approach, which we call data farming.” Rather than “mining” communities for linguistic resources, Lanafrica works to “farm” data collaboratively—turning village squares into living laboratories, involving contributors directly, and ensuring that partnerships support further community-driven research. A practical example is NaijaVoices: over 1,800 hours of curated audio data in Nigerian languages, sourced through relationships built on respect and transparency. For Lanafrica, research access is free, but commercial use involves agreements that channel support back to the language communities, creating a virtuous cycle of benefit and ongoing dataset growth. Who Owns the Data—and Who Reaps the Rewards? Behind every local dataset is a web of legal and ethical questions about ownership and benefit sharing. Dr Melissa Omino, Director at the Centre for Intellectual Property and Information Technology Law (CIPIT), Strathmore University, notes the paradox that emerges when communities contribute their voices but lose control over how their words are used: “Where do you get African languages from? Your first source would be the communities that speak the African languages, right? So you go to the community, record them, the minute you make the recording, then you have created a copyrighted work, and whoever who has made the recording owns the copyright..” According to Professor Chijoke Okorie of the Data Science Law Lab at the University of Pretoria, What ends up happening is that the communities that created these datasets end up paying for products that are built on the datasets that they created. Addressing these inequities requires new licensing models that prioritize context, impact, and fair division of benefits. NOODL Licence: Reimagining Access and Justice The Nwolite Obodo Open Data Licence (NOODL) breaks new ground in recognising African realities. It creates a two-tier licensing system: broad, cost-free access for African users, and negotiations or royalties for wealthy, commercial or international users. Communities remain central, not secondary. Professor Chijoke Okorie describes the ethos of NOODL: “Nwolite Obodo is Igbo… for community raising, community development. We’ve grown from a research group at the University of Pretoria to a research network of researchers from both Anglophone and Francophone Africa.” NOODL’s design puts community benefit at the forefront. As Dr Melissa Omino explains: “If you are from a developing country, let’s say you’re from Brazil, and you want to use the data that is licenced under this regime, then you can use it under Creative Commons licence. If you are a multi-million dollar tech company that wants to use the licence, then you need to negotiate with the AI developers who collected the data. And the licence also ensures that the community gets a benefit.” Towards More Equitable Language Data If African language data is simply extracted and exported, local voices risk exclusion—both from representation and from the benefits of digital innovation. By refocusing efforts on partnership and fair licensing, projects like Lanafrica and NOODL demonstrate that ethical, sustainable language technology is achievable. Their experiences may help guide how other communities and researchers engage with digital resources and cultural heritage—ensuring language data works for those who speak it.