Morning Brief Podcast: India's Pursuit of Large Indic Datasets to Fuel AI and Language Technologies

Sep 26, 2025
788 views
3 min read
148 likes

India is in a race to create its own AI voice, one that understands and reflects the country's diverse linguistic landscape. This push for "Indic AI" is driven by the understanding that AI models trained primarily on Western, English-heavy data cannot adequately meet India's unique needs. Translation alone isn't sufficient; the goal is to develop AI that can "think" in Hindi, Tamil, Marathi, and other Indian languages to capture nuance and reduce bias.

The Imperative for Indic-First AI

The need for AI models tailored to Indian languages stems from several factors. India is a multilingual nation with 22 officially recognized languages and hundreds of dialects. While global Large Language Models (LLMs) have made significant progress in natural language processing, they often struggle with understanding and generating content in Indian languages. Indic LLMs bridge this gap by creating AI models deeply rooted in Indian languages, cultures, and knowledge systems.

Moreover, a focus on Indic languages allows for the development of voice-first AI models, which are seen as a key differentiator for India. As Abhishek Singh, additional secretary at Meity and CEO of the India AI Mission, noted, voice will be the primary way people interact with AI in the future.

The Data Bottleneck

The biggest challenge in building Indic AI models is the scarcity of high-quality language data. Only a small fraction of the world's open datasets are in Indian languages. Even public archives like Doordarshan can be difficult to access. Startups are employing various strategies to overcome this data bottleneck, including:

Crowdsourcing Voices: Gathering voice data from various sources.
Licensing Content: Partnering with publishing houses to license their content.
Generating Synthetic Text: Creating artificial text data to supplement existing datasets.
Negotiating with Ministries: Working with government bodies to access their data.
Partnering with Linguistic Experts: Collaborating with linguists to curate high-quality data.
Using Client Data: Obtaining permission to use client data for training models.

Soket Labs founder Abishek Upperwal emphasizes the importance of curating high-quality data, stating that they have accumulated 8 trillion tokens in Indic languages but need at least 20 trillion.

Government Initiatives and Funding

The Indian government is strongly supporting the development of Indic AI through the India AI Mission, with a budget of nearly ₹10,000 crore. This includes subsidies for GPUs and incentives for building indigenous AI models. The mission aims to provide funding and access to GPU chips for startups. Startups like Sarvam AI, Soket Labs, and Gnani.ai have already been approved to build foundational AI models under the mission.

MeitY, through its Bhashini program, maintains a database of 22 Indian languages, which the AI Mission is offering to help startups build LLMs.

Key Players and Models

Several organizations and startups are actively developing Indic LLMs:

AI4Bharat: A research lab at IIT Madras, AI4Bharat is dedicated to advancing AI technology for Indian languages through open-source contributions. They have developed models like IndicBERT, IndicBART, and IndicTransv2.
Sarvam AI: This startup focuses on building LLMs specifically designed for Indian languages and use cases. They have launched open-source foundational models supporting 10 Indian languages.
CoRover.ai: This company has launched BharatGPT, an indigenous LLM integrated for voice modality in more than 12 Indian languages and 22 languages for text modality.
Soket Labs: This startup is working on building Indic language models and navigating the data scarcity challenge.
Gnani.ai: This company is also part of the IndianAI mission, working with linguistic organizations to gather content across multiple languages.
IIIT Hyderabad: IIIT Hyderabad is driving BharatGen's sovereign AI mission with Indic Vision-Language Models like Patram and eVikrAI.
Tech Mahindra: This company has unveiled Project Indus, an AI model to understand Hindi and its dialects.

Challenges and the Path Forward

Despite the progress, challenges remain. Some experts emphasize the need for national language standardization to ensure the effective use of AI in vernacular languages. There's also a need to catch up with the quality of models like OpenAI.

To overcome these hurdles, companies are focusing on specific problems to solve rather than replicating the path of large models like OpenAI. By focusing on niche applications and leveraging unique datasets, India can build powerful and relevant AI solutions for its diverse population. IBM and BharatGen, for example, are collaborating to accelerate AI adoption in India, powered by Indic LLMs, focusing on underserved languages and dialects.

Written By

Madhav Verma

Madhav Verma is a Bollywood journalist with a strong command over film trends, industry insights, and audience preferences. His writing blends critique, culture, and commentary, giving readers a 360° view of India’s entertainment world. Madhav’s clarity and credibility make him a trusted voice in film media. He’s passionate about decoding what makes cinema timeless.

You may also like ...

Latest Post