Case Studies

Case Studies, TDM Cases

LATAM-GPT: A Culturally Sensitive Large Language Model for Latin America

LATAM-GPT is a groundbreaking large language model developed by the National Center for Artificial Intelligence (CENIA) in Chile, in partnership with over thirty institutions and twelve Latin American countries. The initiative aims to create an open-source AI model that reflects the region’s diverse cultures, languages—including Spanish, Portuguese, and Indigenous tongues—and social realities. Using ethically sourced, regionally contributed data, LATAM-GPT seeks to overcome the limitations and biases of global AI models predominantly trained on English data. The project is designed to empower local communities, preserve linguistic diversity, and support applications in education, public services, and beyond. Its development highlights the importance of ethical data practices, regional collaboration, and policy frameworks that foster inclusive, representative AI. A video version of this case study is available below. 1. Background: Digital Inequality and AI in Latin America Latin America faces unique challenges in digital inclusion, with significant linguistic, cultural, and infrastructural diversity across the region. Global AI models often fail to capture local nuances, leading to inaccuracies and reinforcing stereotypes. Many Indigenous and minority languages are underrepresented in mainstream technology, exacerbating digital exclusion. LATAM-GPT addresses these gaps by building a model tailored to the region’s needs, aiming to democratize access to advanced AI and promote technological sovereignty. 2. Technology and Approach LATAM-GPT is based on Llama 3, a state-of-the-art large language model architecture. The model is trained on more than 8 terabytes of regionally sourced text, encompassing Spanish, Portuguese, and Indigenous languages such as Rapa Nui. Training is conducted on a distributed network of computers across Latin America, including facilities at the University of Tarapacá in Chile and cloud-based platforms. The open-source nature of the project allows for transparency, adaptability, and broad participation from local developers and researchers. 3. Project Overview The project is coordinated by CENIA with support from the Chilean government, the regional development bank CAF, Amazon Web Services, and over thirty regional organizations. LATAM-GPT’s primary objective is to serve as a foundation for culturally relevant AI applications—such as chatbots, virtual public service assistants, and educational tools—rather than directly competing with global consumer products like ChatGPT. A key focus is the preservation and revitalization of Indigenous languages, with the first translation tools already developed for Rapa Nui and plans to expand to other languages. 4. Data Sources and Key Resources LATAM-GPT uses ethically sourced data contributed by governments, universities, libraries, archives, and community organizations across Latin America. This includes official documents, public records, literature, historical materials, and Indigenous language texts. All data is carefully curated to ensure privacy, consent, and cultural sensitivity. Unlike many global AI models, LATAM-GPT publishes its list of data sources, emphasizing transparency and ethical data governance. 5. Legal and Ethical Challenges Copyright and Licensing:The project relies on open-access and properly licensed materials, with explicit permissions from data contributors. This approach avoids the legal uncertainties faced by models that scrape data indiscriminately from the internet. Data Privacy and Consent:CENIA and its partners ensure that sensitive personal information is anonymized or excluded, and that data collection respects the rights and wishes of contributors, especially Indigenous communities. Inclusivity and Bias:By prioritizing local languages and cultural contexts, LATAM-GPT aims to reduce biases inherent in global models. Ongoing community engagement and feedback are integral to the model’s development and evaluation. 6. International and Regional Collaboration LATAM-GPT exemplifies pan-regional cooperation, with twelve countries and over thirty institutions contributing expertise, data, and infrastructure. The project has also engaged international partners and multilateral organizations, such as the Organization of American States and the Inter-American Development Bank, to support its mission of technological empowerment and digital inclusion. 7. Emerging Technology and Policy Issues LATAM-GPT’s open-source model sets a precedent for responsible AI development, emphasizing transparency, ethical data use, and regional self-determination. The project also highlights the need for robust digital infrastructure and continued investment to ensure equitable access to AI across Latin America. As with all large language models, ongoing attention to potential biases, data privacy, and the impact on local labor and education systems is essential. 8. National and Regional Legal Frameworks While LATAM-GPT’s ethical sourcing and licensing practices minimize legal risks, the project underscores the importance of harmonized copyright and data protection laws across Latin America. Policymakers are encouraged to develop frameworks that facilitate data sharing for socially beneficial AI, protect Indigenous knowledge, and promote open science. 9. Contractual or Policy Barriers Some challenges remain in securing permissions for certain data sources, particularly from private publishers or institutions with restrictive contracts. The project’s commitment to open licensing and community engagement helps mitigate these barriers, but continued advocacy is needed to expand access to valuable regional content. 10. Conclusions LATAM-GPT represents a major step forward in creating culturally sensitive, inclusive AI for Latin America. By centering ethical data practices, regional collaboration, and linguistic diversity, the project offers a model for other regions seeking to decolonize AI and ensure technology serves local needs. Continued investment, policy reform, and community participation will be crucial to realizing the full potential of LATAM-GPT and similar initiatives. Video Version Hear from the researchers themselves. Watch the video of this case study below.

Case Studies, TDM Cases

Blind South Africa: Apps for the Visually Impaired

An initiative led by Christo de Klerk at Blind South Africa focuses on promoting the use of accessible mobile and digital applications for blind and visually impaired people in South Africa. Highlighted at the 2025 Copyright and the Public Interest in Africa conference, the project addresses the importance of robust copyright exceptions and supportive legal frameworks to enable the development and use of apps that work in African languages, read aloud texts, describe images and films, and provide local information. The legal challenges – especially around copyright and licensing of accessible content and the use of visual works as AI training data – point to lessons for responsible, inclusive innovation and policy reform in assistive technology. A video version of this case study is available below. 1. Disability, Inequality, and Digital Exclusion in South Africa South Africa has a high prevalence of visual impairment and blindness, with over a million people affected. Barriers to education, employment, and public participation are exacerbated by inaccessible information and digital services. While assistive technologies offer new opportunities for inclusion, their effectiveness depends on being designed for accessibility, linguistic diversity, and local context. 2. Assistive Technology and Accessibility Assistive technology includes devices and software such as screen readers, braille displays, text-to-speech apps, navigation tools, and AI-powered applications that describe images and films. Development is shaped by international standards like the Web Content Accessibility Guidelines (WCAG) and local policy frameworks, ensuring that accessibility is built into digital tools. 3. Project Overview The BlindSA initiative focuses on raising awareness and supporting the use of existing accessible apps for blind and visually impaired users, particularly those supporting African languages and offering features like scene description and audio narration. Rather than developing new apps, the project identifies, tests, and disseminates information about effective tools. It also advocates for legal and policy changes, including strong copyright exceptions, to empower developers and users in South Africa’s multilingual communities. Community participation and user feedback are central to the approach. 4. Data Sources and Key Resources Apps promoted by the initiative draw on diverse data sources: public domain texts, government information, community-generated content, and web-scraped text and images, including photographs, artworks, and films. Many apps use AI to generate scene descriptions and audio narration, opening up previously inaccessible content. The initiative emphasizes tools that support African languages, providing read-aloud and real-time information (such as weather and news) in Zulu, Sotho, and others. However, reliance on copyrighted materials for AI training and accessible formats raises pressing legal questions about copyright exceptions, licensing, and the need for inclusive legal frameworks. 5. Legal and Ethical Challenges Copyright and Licensing:Many resources needed for accessible apps—books, newspapers, educational materials, artworks, photographs, and films—are protected by copyright. South Africa’s current law offers limited exceptions for accessible formats, making it difficult to legally convert works into braille, audio, large print, or described video without explicit permission. Outdated copyright laws have long denied blind South Africans equal access to information, highlighting the need for robust legal exceptions. Contractual Restrictions:Even when content is publicly available, licensing terms may prohibit adaptation or redistribution in accessible formats. Absence of Research and Accessibility Exceptions:Unlike countries that have ratified and implemented the Marrakesh Treaty, South Africa’s copyright regime remains restrictive. BlindSA’s Constitutional Court case challenged these limitations, emphasizing that the right to read is a fundamental human right and that lack of access excludes blind people from education, employment, and society. 6. The Marrakesh Treaty and the Struggle for Equality The Marrakesh Treaty, adopted by WIPO in 2013, requires signatory countries to create copyright exceptions for accessible formats and enables cross-border sharing. It is a milestone in addressing the “book famine” for blind people and affirms access to information as a fundamental right. South Africa signed the Treaty in 2014, but as of 2025, has not fully implemented it into national law, leaving the blind community at a disadvantage. BlindSA has been at the forefront of advocacy for these reforms, including a Constitutional Court challenge to the country’s copyright law. 7. Generative AI, Copyright Backlash, and the Accessibility Exception The rise of generative AI has sparked backlash from creators concerned about unauthorized use of their works as training data and the risk of AI-generated content substituting for original works. However, experts and advocates for the visually impaired emphasize the importance of copyright exceptions for accessibility. When AI is used for accessible formats—such as describing images or reading text aloud—there is no substitution effect, only an expansion of access and rights. Copyright policy and AI regulation should distinguish between commercial, substitutive uses of AI and socially beneficial uses that enable access for the visually impaired. The Marrakesh Treaty and proposed South African reforms recognize the need to balance creators’ interests with the transformative potential of AI for inclusion. 8. National Legal Reform The Copyright Amendment Bill proposes new exceptions for creating and distributing accessible format copies, as well as a general fair use provision. If enacted, these reforms would significantly improve access to information for the visually impaired in South Africa. 9. Contractual or Policy Barriers The initiative highlights the need for open licensing and permissions from publishers and content providers. Some government and educational materials remain inaccessible due to restrictive licenses or outdated contracts. 10. Conclusions Despite legal and policy challenges, BlindSA’s work has raised awareness of accessible apps and digital resources, including text-to-speech tools, navigation apps, and AI-powered scene description in South African languages. User participation is emphasized: “Accessibility is not a luxury or an afterthought—it is a right.” BlindSA’s experience highlights the urgent need for harmonized copyright exceptions, open standards, and inclusive digital policy. Policymakers, funders, and technology developers should prioritize accessibility and the rights of people with disabilities in all digital innovation efforts. Video Version Hear from the researchers themselves. Watch the video of this case study below.

Case Studies, TDM Cases

A Talking Health Chatbot in African Languages: DSFSI, University of Pretoria

A project at the Data Sciences for Social Impact (DSFSI) group, University of Pretoria, led by Professor Vukosi Marivate, is developing a talking health chatbot in African languages to provide accessible, culturally relevant health information to underserved communities. Central to the initiative is the planned use of health actuality TV and radio programmes produced by the South African Broadcasting Corporation (SABC) as training data, which introduces legal and ethical considerations around the use of publicly funded broadcast materials in AI. In the absence of South Africa’s pending copyright reforms, the project may be held up, pointing to the need for harmonised legal frameworks for AI for Good in Africa. 1. Health Inequality in South Africa and the Role of AI South Africa experiences some of the world’s highest health inequalities, shaped by historical, economic, and social factors. Urban centers are relatively well-resourced, but rural and peri-urban areas face critical shortages of health professionals and infrastructure. Language and literacy barriers further exacerbate disparities, with many unable to access health information in their mother tongue. Digital health interventions, especially those using artificial intelligence, offer a way to bridge these gaps by delivering accurate, on-demand health information in local languages. An AI-powered chatbot can empower users to make informed decisions, understand symptoms, and navigate the healthcare system, promoting greater equity in health outcomes. 2. What is Natural Language Processing (NLP)? Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to understand, interpret, and generate human language. NLP powers applications such as chatbots, voice assistants, and automated translation tools, making it crucial for digital inclusion, especially for speakers of underrepresented languages. 3. Project Overview The DSFSI health chatbot project aims to build an AI-powered conversational agent that delivers reliable health information in multiple African languages. The project’s mission is to address health literacy gaps and promote equitable access to vital information, particularly in communities where language and resource barriers persist. 4. Data Sources and Key Resources A distinctive feature of the project is its intention to use health actuality programmes broadcast by the SABC as primary training data. These programmes offer authentic dialogues in various African languages and cover a wide range of health topics relevant to local communities. However, the use of SABC broadcast material introduces significant legal and ethical complexities. The DSFSI team has spent years negotiating with the SABC to secure permission for use of these programmes as training data, but obtaining a definitive answer has proven elusive, leaving the project in a state of legal uncertainty. 5. Legal and Ethical Challenges Copyright and LicensingSABC’s health actuality programmes are protected by copyright, with all rights typically reserved by the broadcaster. Using these materials for AI training without explicit permission may constitute copyright infringement, regardless of educational or social impact goals. Contractual RestrictionsEven if SABC content is publicly accessible, the broadcaster’s terms of use or licensing agreements may explicitly prohibit reuse, redistribution, or data mining. Absence of Research ExceptionsSouth African copyright law currently lacks robust exceptions for text and data mining (TDM) or research use, unlike the European Union’s TDM exceptions or the United States’ Fair Use doctrine. Data Privacy and Community EngagementIf the chatbot is later trained on user interactions or collects personal health information, the project must also comply with the Protection of Personal Information Act (POPIA) and ensure meaningful informed consent from all participants. 6. Public Funding and the Public Interest Argument A significant dimension in negotiations with the SABC is the broadcaster’s funding structure. The SABC operates under a government charter and receives substantial public subsidies, with direct grants and bailouts accounting for about 27% of its 2022/2023 revenue. This strengthens the argument that SABC-produced content should be accessible for public interest projects, particularly those addressing urgent challenges like health inequality and language inclusion. Many in the research and innovation community contend that publicly funded content should be available for projects benefiting the broader public, especially those focused on health literacy and digital inclusion. 7. The WIPO Broadcasting Treaty: A New Layer of Complexity The international copyright landscape is evolving, with the World Intellectual Property Organization (WIPO) currently negotiating a Broadcasting Treaty. Recent drafts propose granting broadcasters—including public entities like the SABC—new, additional exclusive rights over their broadcast content, independent of the underlying copyright. Some drafts suggest these new rights could override or negate existing copyright exceptions and limitations, including those that might otherwise permit uses for research, education, or public interest projects. If adopted in its current form, the WIPO Broadcasting Treaty could further restrict the ability of researchers and innovators to use broadcast material for AI training, even when the content is publicly funded or serves a vital social function. 8. The Copyright Amendment Bill: Introducing Fair Use in South Africa A potentially transformative development is the Copyright Amendment Bill, which aims to introduce a Fair Use doctrine into South African law. Modeled after the U.S. system, Fair Use would allow limited use of copyrighted material without permission for research, teaching, and public interest innovation—the core activities of the DSFSI health chatbot initiative. If enacted, the Bill would provide a much-needed legal pathway for researchers to use materials like SABC broadcasts for AI training, provided the use is fair, non-commercial, and does not undermine the market for the original work. However, the Bill has faced significant opposition and delays, and is currently under review by the Constitutional Court, leaving its future uncertain. 9. Contractual or Policy Barriers In the absence of clear research exceptions, the project team must review and potentially negotiate with the SABC to secure permissions or licenses for the intended use of broadcast content. Without such agreements, the project may be forced to exclude valuable data sources or pivot to community-generated content. 10. Cross-Border and Multi-Jurisdictional Issues If the chatbot expands to use or serve content from other African countries, it will encounter a patchwork of copyright and data protection laws, further complicating compliance and cross-border collaboration. 11. Conclusions The challenges faced by the DSFSI health chatbot project

Case Studies, TDM Cases

Masakhane: Use of the JW300 Dataset for Natural Language Processing

The Masakhane Project showcases the transformative power of open, collaborative efforts in advancing natural language processing (NLP) for African languages. However, its reliance on the JW300 dataset—a vast multilingual corpus primarily comprising copyrighted biblical translations—uncovered significant legal and ethical challenges. These challenges focused on copyright restrictions, contract overrides, and the complexities of cross-border data use. This led to the discontinuation of JW300’s use within Masakhane, prompting a shift toward community-generated data. The experience illustrates the urgent need for robust copyright exceptions, clear legal frameworks, and ethical data sourcing to foster innovation and inclusivity in global NLP research. A video version of the case study is available below. 1. What Is Natural Language Processing? Natural Language Processing (NLP) is a branch of computer science and artificial intelligence focused on enabling computers to understand, interpret, and generate human language, both written and spoken. NLP integrates computational linguistics with machine learning, deep learning, and statistical modeling, allowing machines to recognize patterns, extract meaning, and respond to natural language inputs in ways that approximate human comprehension. NLP underpins many everyday technologies, including search engines, digital assistants, chatbots, voice-operated GPS systems, and automated translation services. NLP is crucial for breaking down language barriers and has become integral to the digital transformation of societies worldwide1. 2. Masakhane Project Overview The Masakhane Project is an open-source initiative dedicated to advancing NLP for African languages. Its mission is to democratize access to NLP tools by building a continent-wide research community and developing datasets and benchmarks tailored to Africa’s linguistic diversity. By engaging researchers, linguists, and technologists across the continent, Masakhane ensures that African languages are not marginalized in the digital age. The project employs advanced sequence-to-sequence models, training them on parallel corpora to enable machine translation and other NLP tasks between African languages. The distributed network of contributors allows Masakhane to address the unique challenges of Africa’s linguistic landscape, where many languages lack sufficient digital resources. A notable achievement is the “Decolonise Science” project, which creates multilingual parallel corpora of African research by translating scientific papers from platforms like AfricArxiv into various African languages. This initiative enhances access to academic knowledge and promotes the use of African languages in scientific discourse, exemplifying Masakhane’s commitment to African-centric knowledge production and community benefit. 3. JW300 Dataset and Its Role The JW300 dataset was pivotal to Masakhane’s early work. It offers around 100,000 parallel sentences for each of over 300 African languages, mostly sourced from Jehovah’s Witnesses’ biblical translations. For many languages, JW300 is one of the only large-scale, aligned text sources available, making it invaluable for training baseline translation models such as English-to-Zulu or English-to-Yoruba. Masakhane utilized automated scripts for downloading and preprocessing JW300, including byte-pair encoding (BPE) to optimize model performance. Community contributions further expanded the dataset’s coverage, filling language gaps and improving resource quality. JW300’s widespread use enabled rapid progress in building machine translation models for underrepresented African languages. 4. Copyright Infringement Discovery Despite JW300’s open availability on platforms like OPUS, its use was legally problematic. In 2023, a legal audit by the Centre for Intellectual Property and Information Technology (CIPIT) in Nairobi revealed that the Jehovah’s Witnesses’ website explicitly prohibited text and data mining in its copyright notice. This meant Masakhane’s use of JW300 was unauthorized. When Masakhane’s organizers formally requested permission to use the data, their request was denied. This highlighted a fundamental tension between Masakhane’s open research ethos and the proprietary restrictions imposed by the dataset’s owners, forcing the project to reconsider its data strategy. 5. Copyright Exceptions and Limitations: The Role of TDM Exceptions and Fair Use Many jurisdictions provide copyright exceptions and limitations to balance creators’ rights with the needs of researchers and innovators. The European Union’s text and data mining (TDM) exceptions and the United States’ Fair Use doctrine are prominent examples. The EU’s Directive on Copyright in the Digital Single Market (Directive (EU) 2019/790) introduced two mandatory TDM exceptions. The first allows research organizations and cultural heritage institutions to conduct TDM for scientific research, regardless of contractual provisions. The second permits anyone to perform TDM for any purpose, provided the rights holder has not expressly opted out. Recent German case law clarified that an explicit reservation in a website’s terms is sufficient to exclude commercial TDM, but the exception remains robust for research contexts. In the U.S., the Fair Use doctrine allows limited use of copyrighted material without permission for purposes like criticism, comment, teaching, scholarship, or research. Courts increasingly recognize that using copyrighted works to train AI models can qualify as Fair Use, especially when the use is transformative and does not harm the original work’s market. Had Masakhane operated in the EU or U.S., these exceptions might have provided a legal basis for using JW300 for non-commercial research. However, most African countries lack clear TDM provisions or Fair Use recognition, exposing researchers to greater legal uncertainty and risk. The Masakhane experience underscores the need for African nations to adopt or clarify copyright exceptions that support research and digital innovation. 6. Contract Overrides Contract overrides occur when contractual terms—such as website terms of service—impose restrictions beyond those set by statutory copyright law. In JW300’s case, Jehovah’s Witnesses’ website terms explicitly prohibit text and data mining, overriding any potential exceptions or fair use provisions. For Masakhane, this meant that even if their use could be justified under fair use or research exceptions in some jurisdictions, the contractual terms imposed stricter limitations. Only in jurisdictions where statutes prevent contracts from overriding copyright exceptions (such as the EU’s TDM provision for research institutions) could these terms be challenged. This highlights the importance of reviewing all terms of service and data use agreements before using third-party datasets, especially in open, cross-border research projects. 7. Cross-Border Use The cross-border nature of datasets like JW300 adds further legal complexity, especially for open research projects with contributors across multiple countries. Masakhane operates in a pan-African context, with team members and users in different nations. Copyright and data use laws vary widely. What is permissible under fair

Blog, Case Studies, TDM Cases

The NOODL license: Licensing African datasets to support research and AI in the Global South

With the increasing prominence of AI in all sectors of our economy and society, access to training data has become an important topic for practitioners and policy makers. In the Global North, a small number of large corporations with deep pockets have gained a head start in AI development, using training data from all over the world. But what about the creators and the communities whose creative works and languages are being used to train AI models? Shouldn’t they also derive some benefit? And what about AI developers in Africa and the Global South, who often struggle to gain access to training data? In an effort to try to level the playing field and ensure that AI supports the public interest, legal experts and practitioners in the Global South are developing new tools and protocols which aim to tackle these questions. One approach is to come up with new licenses for datasets. In a pathbreaking initiative, lawyers at the University of Strathmore in Nairobi have teamed up with their counterparts at the University of Pretoria to develop the NOODL license. NOODL  is a tiered license, building on Creative Commons, but with preferential terms for developers in Africa and the Global South. It also opens the door for recognition and a flow of benefits to creators and communities.  NOODL was inspired by researchers using African language works to develop Natural Language Processing systems, for purposes such as translation and language preservation. In this presentation, Dr Melissa Omino, the Head of the Centre for Intellectual Property and Information Technology Law (CIPIT) at Strathmore University in Nairobi, Kenya, talks about the NOODL license.  This presentation was originally delivered at the Conference on Copyright and the Public Interest in Africa and the Global South, in Johannesburg in February 2025. The full video of the presentation is available here.  Licensing African Datasets to ensure support for Research and AI in the Global South Dr Melissa Omino Introduction [Ben Cashdan]: We have Dr. Melissa Omino from CIPIT at the University of Strathmore in Nairobi to talk a little bit about a piece of work that they’re doing to try and ensure that the doors are not closed, that there is some opportunity to go on doing AI, doing research in Africa, but not necessarily throwing the doors open to everybody to do everything with all our stuff. Tell us a little bit about that.  [Dr Melissa Omino] Well, I really like that introduction. Yes, and that was the thinking behind it. Also, it’s interesting that I’m sitting next to Vukosi [Marivate, Professor of Computer Science at University of Pretoria] because Vukosi has a great influence on why the license exists. You’ve heard him talking about Masakhane and the language data that they needed. In the previous ReCreate conference, where we talked about the JW300 dataset, I hope you all know about that. If you don’t know, this is a plug for the ReCreate YouTube channel so that you can go and look at that story. That’s a Masakhane story. Background: The JW300 Dataset To make sure that we’re all together in the room, I’ll give you a short synopsis about the JW300 dataset. Vukosi, you can jump in if I get something wrong. Essentially, Masakhane, as a group of African AI developers, were conducting text data mining online for African languages so that they could build AI tools that solve African problems. We just had a wonderful example right now about the weather in Zulu, things like that. That’s what they wanted to cater for and the solutions they wanted to create. They went ahead and found [that there are] very minimal datasets or data available online for African problem solving, basically in African languages. But they did find one useful resource, which was on the Jehovah’s Witness website, where it had a lot of African languages because they had translated the Bible into different African languages. They were utilizing this in what was called the JW300 dataset. However, somehow, I don’t know how, you guys thought about copyright. They thought about copyright after text data mining. They thought, hey, can you actually use this dataset? That’s how they approached it. The first thing we did was look at the website. Copyright notices excluding text and data mining Most websites have a copyright notice, and a copyright notice lets you know what you can and can’t do with the copyright material that is presented on the website. The copyright notice on the Jehovah’s Witness website specifically excluded text data mining for the data that was there. We went back to Masakhane and said, sorry, you can’t use all this great work that you’ve collected. You can’t use it because it belongs to Jehovah’s Witness, and Jehovah’s Witness is an American company registered in Pennsylvania. They asked us, how is it that this is African languages from different parts of Africa, and the copyright belongs to an American company, and we cannot use the language? I said, well, that’s how the law works. And so they abandoned the JW300 dataset. This created a new avenue of research because Masakhane did not give up. They became innovative and decided to collect their own language datasets. And not only is Masakhane doing this, Kencorpus is also doing this by collecting their own language datasets. Building a Corpus of African Language Data But where do you get African languages from? People. You go to the people to collect the language, right? If you’re lucky, you can find a text that has the language, but not all African languages will have the text. Your first source would be the communities that speak the African languages, right? So you’re funded because collecting language is expensive – Vukosi can confirm.  He’s collecting 3,000 languages or 3,000 hours of languages. His budget is crazy to collect that. So you collect the language. You go to the community, record them however you want to do that. Copyright experts will tell you the minute you

Scroll to Top