What we do ?
The goals of Project Anuvaad include:
Train deep learning models for Indic language using these and other publicly available corpora. It is our goal to have high quality (Neural Machine Translation) NMT models for all major Indian languages. As of May 2020 we have models for nine Indian languages.
Create parallel corpora that can be used to train NMT Tools and utilities to help create such parallel corpus. These copora may be may be general or domain specific. It is the stated goal of the project to create the largest publicly available parallel corpora in Indic languages.
Develop interactive translation tools to help users to obtain “final” translated output.
Develop and maintain open source implementations of OCR tools in Indic scripts for pre-processing of documents in Indic languages
Why do this?
90% of India does not speak English. The Eighth Schedule of the Indian Constitution lists 22 official languages with 6,000-plus dialects and 55-plus languages with 1 million-plus speakers. Hence, it goes without saying that translation is an important national priority.
Machine Translation (and Natural Language Processing (NLP) in general) is a field that has made dramatic progress in the last few years. While the core technology is available as open source, there is no credible open source translation alternative for Indic languages. Project Anuvaad hopes to fill this gap and help us take control of our own languages.
An example of domain where our technology can impact society is the judicial system. Reducing the time and effort to obtain high quality translations to and from Indian languages can help quality translations can signifcantly reduce pendencies in the judicial system. Project Anuvaad has assisted the Honorable Supreme Court of India in the launch of SUVAS to help make progress in this matter.
We are leveraging AI more specifically Neural Machine Translation techniques to achieve our translation goals. We are using and tweaking open source projects. We have created an end to end translation pipeline, toolchains to achieve state-of-the-art translation quality for the selected domain. We have made sure translators become the centerpoint of the solution. This makes proofreading an integral part of the solution. The corrected data goes back to the training pipeline as well as to parallel corpus.
We have used the government of India, various state government circulars, notification, courts judgments, orders, notices, press releases, parliament proceedings as our data sources. We have built various tools to extract single, parallel sentences from these documents.
Translation is active area where various technology powerhouses are also focusing upon. Team Anuvaad is leveraging various open-source project to build a
ready to use production grade translation solution, specifically for Indic languages.
We are also releasing our dataset that we have collected and collecting from various sources. These datasets are cleaned, quality assured and released under MIT-license. We hope our dataset would be useful and helpful for evangelist as well for academicians.
We believe in openness and transparency. We invite language enthusiasts to experience what we have built.
Why not experience the legal document translation firsthand ?
While we are preparing Anuvaad for beta launch, we currently signing up invite-only user. Please fill-in a short form to express your participation, Team Anuvaad will definitely reach out to you !