Building the digital tools for Indigenous languages to flourish online

Building the digital tools for Indigenous languages to flourish online

Today there are around 6,000-7,000 languages around the world, however, digital representation of most of these languages is sparse.

(Science Photo Library via AFP/Daniel Buah)

“If people like us, the new generation, do not learn this language and do not write or speak this language, a lot of the things will be missed. Within a few generations the language will disappear entirely from the world,” cultural researcher Dr. Subhash Ram Prajapati tells Equal Times. Prajapati, originally from Thimi (a small town in Nepal’s Kathmandu Valley), says he is one of the few members of the Newar community to study his Indigenous language (Nepalbhasa, also known as Nepal Bhasa, Newar or Newari), earning a joint degree in Nepalbhasa and Nepalese culture while learning computer science.

Nepal has a population of approximately 30 million people, who speak 129 languages between them. Between 1952 and 2011 the percentage of Nepalbhasa speakers declined from 75 per cent to 22 per cent in the language’s home region of the Kathmandu Valley. Despite the Kathmandu Metropolitan City government mandating the teaching of Nepalbhasa from grades 1-8 in 2020, this is not a national policy, and many parents do not teach the language at home. “Whoever wants to learn this language these days, if they do not have the opportunity to learn at home, it is through the internet or through some books,” explains Prajapati. However, a lack of Nepalbhasa digitisation means that online accessibility is severely underdeveloped.

Since 2009, Prajapati has been addressing the issue, marrying his passion for Nepalbhasa and computer science to bring his community greater online access in their language.

He founded the non-profit web portal nepalmandal.com which has grown to contain over 25,000 articles and audiovisual resources on a variety of subjects – language and culture, as well as sports, technology and fashion – in Nepalbhasa. “When the commercial online portal used to post five articles a day, we used to post 25 articles a day.” Today “it’s the biggest portal in Nepalbhasa online” and relies on the voluntary contribution of writers, photographers and editors to produce wide-ranging Nepalbhasa news content.

Prajapati’s interest in the topic began early. “As a child, I realised that there has not been enough studying of Newar music and I started collecting various papers and books.” Since then, the cultural researcher has harnessed similar vigour to establish the Nepalbhasa Learning Club, a global Facebook community of over 4,300 members who regularly impart linguistic insights by posting questions, discussing grammar and sharing tutorials. Today this vital online space is helping to remedy a decline whereby a third of Newar’s 1.3 million population no longer speaks Nepalbhasa, leading UNESCO to label it a ‘definitely endangered’ language.

Prajapati says that Nepalbhasa’s decline began “after the political change in Nepal when the Gorkha came [in the 18th century] and conquered the Kathmandu Valley”. The Rana regime from the 19th century until the mid-20th century and the Panchayat system from 1960-1990 drove the language to decline, by banning it. Prior to the restoration of multiparty democracy in 1990, Nepal endured a policy of one language and one culture, Nepali, driving a process of linguistic and cultural suppression which impacted the rights of Indigenous communities like the Newar.

Once Nepal’s national language, Nepalbhasa provided many kinship words to Nepali, documented the community’s history, and preserved ancient knowledge like architecture and medicines. In January 2020, this rich legacy inspired Prajapati to embark on his largest revitalisation project, Nepalbhasa.org, an online and print dictionary with “over 30,000 words and meanings”. Fifteen volunteers are collaborating to fortify Nepalbhasa.org’s online footprint. Editors, technicians and software engineers are addressing challenges caused by Nepalbhasa’s different regional dialects which affect digitisation, as well as working on script convergence to make it fully inclusive. Last year, Prajapati also contributed towards a Google Translate project for the Nepalbhasa language. “It is still a work-in-progress but hopefully it will be out soon so that people can use Nepalbhasa and Google Translate.”

Digital underdevelopment

Today there are around 6,000-7,000 languages around the world, however, digital representation of most of these languages is sparse. For example, Google Translate offers communication in just 109 languages while Microsoft Translator does so in 100 languages. Research suggests that within the next century at least half of the world’s languages will become extinct, with over 85 per cent of alphabets already considered ‘endangered’, a process that is undoubtedly exacerbated by the unequal pace of digital development worldwide.

As the International Decade for Indigenous Languages (2022-2032) gets under way, Prajapati is calling for more resources “so that people can learn, use and break the language barrier in technologies”; failure to do so risks impacting speakers of Indigenous languages to the degree whereby they may abandon use of their native languages entirely in favour of more widely spoken languages. But the problem appears to be two-fold: on one hand, there seems to be a general lack of interest from big tech players for most smaller languages, because of the lack of financial incentive to invest in them; on the other hand, the groups that want to see better representation of their native languages online often lack the resources to do so.

This is something that Blessing Sibanda is all too familiar with. “I know at some point, some of the context or meaning is going to be lost, so I am not very comfortable using my language online,” says the software engineer, natural language processing (NLP, which provides machines with the humanistic ability to decipher text and spoken word) researcher and translator from Zimbabwe. Across the continent, Africa is home to around a third of the world’s languages, but languages like Shona (one of Zimbabwe’s 16 official languages, which is what Sibanda speaks) risk becoming obsolete online.

Despite its rich history in offline resources and lingua franca recognition during the British colonial era, like most African languages, Shona remains digitally underdeveloped, says Sibanda, which affects the ability of its estimated 10.7 million speakers to access reliable information in their mother tongue online.

Shona, like Nepalbhasa, is what is known as a ‘low-resource language’: there simply isn’t enough data being inputted and updated into language processing systems to improve the accuracy of machine translations (MTs) in these languages.

As a result, available tech tools often miss vital aspects of online translation: from incomplete or unreliable translations compared to dominant languages, to the overlooking of key words and the inability to provide direct translations for new words and phrases like ‘Covid-19’.

Sibanda has been grappling with these issues, and others, since joining the pan-African grassroots organisation Masakhane, which is attempting to redress the near total dearth of African languages in the technological space by driving “NLP research in African languages, for Africans, by Africans”. Masakhane, a name which translates to ‘we build together’ in isiZulu, hopes to help counter the legacy of centuries of colonialism in Africa, which has resulted in a “technological space that does not understand our names, our cultures, our places, our history”.

Sibanda further explains: “Masakhane started without any backing, just resourcefulness through the use of open-source and free tools to collaborate, conduct experiments and organise community events”. For the last three years Masakhane has been uniting developers and researchers under the foundation of long-term collaboration for “Africans to shape and own these technological advances towards human dignity, well-being and equity”.

For Sibanda, “as a native speaker, it’s good to take control of the technologies that are being created around your language because you better understand the nuances of the language and you can create better systems,” adding that software developers are not always native speakers of the languages that they work on, suggesting a potential emphasis on quantity over quality in emerging language technologies.

In 2019, after joining Masakhane’s community of 1,000 diverse participants from 30 African nations working on 67 African languages, Sibanda feared he might be “restricted to conducting experiments and writing code”. However, the “inclusive community-building, open participatory research and multi-disciplinarity” approach of Masakhane allowed him to work on developing digital tools in Shona, something that Sibanda says that few others are doing.

“At the beginning I started working on machine translation,” explains Sibanda, who focused on training a Shona language model and “trying to find ways where you can train systems without a lot of global data”. Research suggests MT systems tend to experience difficulties when data is scarce which was the case for Sibanda. “We trained translation models but to get the data, we had to use a Jehovah’s Witness dataset,” he explains, as the Jehovah Witness community undertook a pioneering, three-year project translating and publishing a revised edition of the New World Translation (the most common version of the Bible after the King James Version) in Shona in 2019. But it was not without its limitations: “It’s a very religious kind of dataset. It misses some content in terms of the natural kind of languages that we’re using,” says Sibanda.

He continues: “I was then involved in the human evaluation of the trained models on translating surveys,” he says. This resulted in a published research paper, Participatory research for low-resourced machine translation: A case study in African languages. He then undertook crucial work where African languages and research are underrepresented: the creation and coordination of annotated Shona datasets for named-entity recognition (NER).

NER is an information extraction task, turning information from unstructured text into categories like dates, identities and locations. It plays a fundamental role in products like spell-checkers and the localisation of voice and dialogue systems, due to its reproducible results.

This foundational work is helping to strengthen African language digitisation and research after inspiring another research paper, MasakhaNER: Named Entity Recognition for African languages.

While big tech has arguably overlooked African languages, in October 2020, Google pledged to invest US$1 billion over the next five years to solidify fast and cheaper internet in Africa. For organisations like Masakhane, the news suggests that the tide could be turning in their favour, particularly after Google also promised African startups investment of US$50 million alongside access to its employees, network and technologies, as the company aims to drive the continent’s digital transformation.

In terms of achievements, Masakhane has successfully created a prototype translator for Shona and five other African languages: Yoruba and Igbo from Nigeria, Lingala and Tshiluba from the Democratic Republic of Congo, and Swahili, which is widely spoken in East Africa. Currently the organisation is focusing efforts on data-gathering and transcription in order to create the foundation to advance language technologies. Ultimately Masakhane’s vision is for machine translation and artificial intelligence to transform the digital footprint of African languages, as communities like DataScience Zimbabwe, Digital Umuganda, Deep learning Indaba and Data Science Africa also make valuable contributions to the field. However, with over 50 native African languages already disappearing from the world, technologists are warning that if African languages are not included in algorithms, they risk obsolescence.