India is in a race to create its own AI voice, one that understands and reflects the country's diverse linguistic landscape. This push for "Indic AI" is driven by the understanding that AI models trained primarily on Western, English-heavy data cannot adequately meet India's unique needs. Translation alone isn't sufficient; the goal is to develop AI that can "think" in Hindi, Tamil, Marathi, and other Indian languages to capture nuance and reduce bias.
The Imperative for Indic-First AI
The need for AI models tailored to Indian languages stems from several factors. India is a multilingual nation with 22 officially recognized languages and hundreds of dialects. While global Large Language Models (LLMs) have made significant progress in natural language processing, they often struggle with understanding and generating content in Indian languages. Indic LLMs bridge this gap by creating AI models deeply rooted in Indian languages, cultures, and knowledge systems.
Moreover, a focus on Indic languages allows for the development of voice-first AI models, which are seen as a key differentiator for India. As Abhishek Singh, additional secretary at Meity and CEO of the India AI Mission, noted, voice will be the primary way people interact with AI in the future.
The Data Bottleneck
The biggest challenge in building Indic AI models is the scarcity of high-quality language data. Only a small fraction of the world's open datasets are in Indian languages. Even public archives like Doordarshan can be difficult to access. Startups are employing various strategies to overcome this data bottleneck, including:
Soket Labs founder Abishek Upperwal emphasizes the importance of curating high-quality data, stating that they have accumulated 8 trillion tokens in Indic languages but need at least 20 trillion.
Government Initiatives and Funding
The Indian government is strongly supporting the development of Indic AI through the India AI Mission, with a budget of nearly ₹10,000 crore. This includes subsidies for GPUs and incentives for building indigenous AI models. The mission aims to provide funding and access to GPU chips for startups. Startups like Sarvam AI, Soket Labs, and Gnani.ai have already been approved to build foundational AI models under the mission.
MeitY, through its Bhashini program, maintains a database of 22 Indian languages, which the AI Mission is offering to help startups build LLMs.
Key Players and Models
Several organizations and startups are actively developing Indic LLMs:
Challenges and the Path Forward
Despite the progress, challenges remain. Some experts emphasize the need for national language standardization to ensure the effective use of AI in vernacular languages. There's also a need to catch up with the quality of models like OpenAI.
To overcome these hurdles, companies are focusing on specific problems to solve rather than replicating the path of large models like OpenAI. By focusing on niche applications and leveraging unique datasets, India can build powerful and relevant AI solutions for its diverse population. IBM and BharatGen, for example, are collaborating to accelerate AI adoption in India, powered by Indic LLMs, focusing on underserved languages and dialects.