African languages go mainstream in AI with a huge new dataset
Artificial intelligence tools such as ChatGPT, DeepSeek, Siri, or Google Assistant are created by the global north and are primarily trained in English, Chinese, or European languages. In contrast, African languages are significantly underrepresented on the internet. A group of African computer scientists, linguists, language specialists, and others has been diligently addressing this issue for the past two years. The African Next Voices project, primarily funded by the Gates Foundation (with additional support from Meta) and involving a network of African universities and organizations, has recently unveiled what is believed to be the largest dataset of African languages for AI to date. We inquired about their project, which includes sites in Kenya, Nigeria, and South Africa. Language serves as our medium for interaction, a means to seek assistance, and a vessel for meaning within our communities. It serves as a tool to structure intricate thoughts and convey ideas. The medium serves as our means to communicate our desires to an AI and to evaluate its comprehension of our requests. An increase in applications utilizing AI is evident across various sectors, including education, health, and agriculture. These models are developed using extensive amounts of predominantly linguistic data. These are referred to as large language models or LLMs, yet they are present in only a limited number of the world’s languages. Languages embody culture, values, and local wisdom. If AI does not communicate in our languages, it cannot accurately grasp our intentions, and we cannot place our trust in or confirm its responses. In summary: without language, AI is unable to communicate with us – and we are unable to communicate with it. Constructing AI in our languages is, consequently, the sole method for AI to serve the needs of people. Limiting the languages that are modeled could lead to the exclusion of a vast array of human cultures, histories, and knowledge.
The evolution of language is deeply connected to the narratives of individuals and communities. Numerous individuals who lived through colonialism and empire have witnessed the marginalization of their own languages, which have not been developed to the same degree as the languages of the colonizers. African languages are infrequently documented, particularly online. There is a lack of sufficient digitised text and speech to effectively train and evaluate robust AI models. The scarcity stems from decades of policy decisions that favor colonial languages in educational institutions, media, and governmental frameworks. Language data is merely one of the elements that is lacking. Are there dictionaries, terminologies, and glossaries available? There are only a limited number of basic tools available, and numerous other factors contribute to the increased cost of constructing datasets. These encompass African language keyboards, fonts, spell-checkers, tokenisers (which segment text into smaller components for better comprehension by a language model), orthographic variation (the discrepancies in spelling across different regions), tone marking, and a rich diversity of dialects. The outcome is artificial intelligence that exhibits subpar performance and, at times, poses safety risks: inaccuracies in translations, inadequate transcription, and systems that struggle to comprehend African languages. This effectively restricts numerous Africans from accessing global news, educational resources, healthcare information, and the productivity benefits that AI can provide, all in their native languages. When a language is absent from the data, its speakers are excluded from the product, resulting in AI that cannot be safe, useful, or fair for them. They ultimately lack the essential language technology tools that could enhance service delivery. This excludes millions of individuals and exacerbates the technology gap.
The primary aim is to gather speech data for automatic speech recognition. ASR serves as a crucial instrument for languages that are predominantly spoken. This technology transforms spoken language into written text. The overarching goal of our project is to investigate the methods of data collection for ASR and to determine the volume required to develop effective ASR tools. Our objective is to convey our experiences from various geographic regions. The data we gather is intentionally varied: encompassing spontaneous and read speech across multiple domains, including everyday conversations, healthcare, financial inclusion, and agriculture. We are gathering information from individuals of various ages, genders, and educational backgrounds. All recordings are gathered with informed consent, equitable compensation, and transparent data rights agreements. We adhere to language-specific guidelines and conduct a comprehensive array of technical checks during transcription. In Kenya, we are gathering voice data for five languages through the Maseno Centre for Applied AI. We are documenting the three primary language groups: Nilotic, which includes Dholuo, Maasai, and Kalenjin; Cushitic, represented by Somali; and Bantu, exemplified by Kikuyu. Through Data Science Nigeria, we are gathering speech in five widely spoken languages: Bambara, Hausa, Igbo, Nigerian Pidgin, and Yoruba. The dataset seeks to faithfully represent genuine language usage within these communities. In South Africa, we have been recording seven South African languages through the Data Science for Social Impact lab and its collaborators. The objective is to showcase the nation’s abundant linguistic variety: isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele, and Tshivenda.
This work is crucially interconnected with broader efforts. We are leveraging the momentum and insights gained from the Masakhane Research Foundation network, Lelapa AI, Mozilla Common Voice, EqualyzAI, and numerous other organizations and individuals who have been at the forefront of developing African language models, data, and tools. Every project reinforces the others, collectively creating a burgeoning ecosystem dedicated to enhancing the visibility and usability of African languages in the era of AI. The data and models will serve as valuable tools for captioning local-language media, facilitating voice assistants in agriculture and health, and enhancing call-centre and support services in various languages. The data will also be preserved for cultural heritage. More extensive and well-rounded African language datasets that are publicly accessible will enable us to link text and speech resources effectively. Models will serve not only as experimental frameworks but also as valuable assets in chatbots, educational tools, and local service delivery. The opportunity exists to transcend mere datasets and delve into ecosystems of tools—such as spell-checkers, dictionaries, translation systems, and summarisation engines—that ensure African languages thrive as a vibrant presence in digital spaces.
In summary, we are combining ethically sourced, premium speech at scale with models. The goal is for individuals to communicate naturally, be comprehended accurately, and utilize AI in the languages that reflect their everyday lives. This project exclusively gathered voice data for specific languages. What about the remaining languages? What about other tools such as machine translation or grammar checkers? We will persist in our efforts across various languages, guaranteeing that we develop data and models that accurately represent the ways in which Africans utilize their languages. Our focus is on developing smaller language models that are energy efficient and precise for the African context. The current challenge lies in integration: ensuring that these components function cohesively, so that African languages are not merely showcased in isolated demonstrations, but are actively utilized in real-world platforms. One of the lessons from this project, and others like it, is that collecting data is only step one. What is essential is ensuring that the data is benchmarked, reusable, and connected to communities of practice. Our focus now is to ensure that the ASR benchmarks we develop can integrate with other ongoing initiatives across Africa. It is essential to guarantee sustainability: that students, researchers, and innovators maintain access to computing resources and processing power, as well as training materials and licensing frameworks (such as NOODL or Esethu). The long-term vision is to empower choice: allowing a farmer, a teacher, or a local business to utilize AI in isiZulu, Hausa, or Kikuyu, rather than being limited to English or French. If we succeed, built-in AI in African languages will not merely be playing catch-up. It will be establishing new benchmarks for inclusive, responsible AI on a global scale.








