Indian AI startup, Sarvam AI, has taken a significant step by open-sourcing its first Hindi language model, named OpenHathi-Hi-0.1. The release is part of Sarvam AI’s broader initiative to contribute to the Indian language AI ecosystem by providing open models and datasets to foster innovation in the field.
Built on Meta AI’s Llama 2-7B model, the blog post from the company claims that OpenHathi-Hi-0.1 is comparable to GPT-3.5 for Indic languages. The development process addressed the challenge of tokenization, a crucial aspect of processing text in large language models, particularly for Hindi, where training text is limited, making the process more costly compared to English. The team worked in two phases to make tokenization more cost-effective.
The model underwent testing on various benchmarks, including standard assessments like translation, as well as newer ones such as toxicity classification and text classification. The base model is now available on the Hugging Face platform, enabling developers to fine-tune and use it for specific use cases.
Sarvam AI’s co-founders, Pratyush Kumar and Vivek Raghavan, previously collaborated with AI4Bharat, and the startup has partnered with AI4Bharat to leverage their language resources and benchmarks for training OpenHathi.
After securing $41 million in Series A funding led by Lightspeed Ventures, with participation from Peak XV and Khosla Ventures, the five-month-old startup is focused on building large language models using voice as a common interface to meet the diverse demands of the Indian market. Additionally, Sarvam AI is actively working on a range of enterprise-grade models on its full-stack Generative AI platform, set to be released soon.