• By Anurag Mishra
  • Mon, 04 Aug 2025 06:56 PM (IST)
  • Source:JND

BharatGen is India's indigenous generative AI model, developed keeping in mind India's linguistic diversity, cultural context and digital self-reliance. It is a collaborative effort involving premier institutions like IIT Bombay, IIT Madras, IIIT Hyderabad, IIT Kanpur, IIM Indore, and IIT Mandi. This project is funded by the Department of Science and Technology. Last month, Union Minister Dr Jitendra Singh launched 'BharatGen', India's first indigenously developed, government-funded, AI-based multimodal Large Language Model (LLM) for Indian languages at the BharatGen Summit. Professor Ganesh Ramakrishnan at IIT Bombay is the head of BharatGen. Anurag Mishra spoke with him about BharatGen and AI innovation in India.

What was the inspiration behind the development of BharatGPT, and how does it fit into India's larger vision of self-reliance in AI?

The inspiration for BharatGen is linked to advancing India's digital freedom, Indian languages, and cultural thought. Its aim is to create an AI system that is equal and accessible for all. BharatGen wants to bring generative AI within the reach of common people, especially in the context of Indian languages and India's social fabric.

How is BharatGPT different from global LLM models like GPT or Gemini, especially in terms of Indian languages, cultural context and multimodal capabilities?

BharatGen is a collaborative effort that includes premier institutions like IIT Bombay, IIT Madras, IIIT Hyderabad, IIT Kanpur, IIM Indore, and IIT Mandi. It is funded by the Department of Science and Technology through IIT Bombay.

BharatGen prioritises model architecture based on local linguistic diversity, data localisation, Devanagari and other Indian scripts, cultural idioms and Indian contexts. It is also releasing India-centric benchmarks on platforms like India's AI mission, 'AIKosh', and Hugging Face, and has been tailored based on Indian ground-level use cases.

What technical and linguistic challenges were faced while developing BharatGPT, especially concerning low-resource Indian languages and dialects?

- Lack of efficient and diverse multilingual data.

- Tokenisation challenges of Indian language scripts (Tokenisation challenges are very significant in the context of Indian languages because India's languages are quite diverse at the level of structure, script and grammar. Tokenisation means breaking down a sentence or text into small parts (tokens) so that a machine can process them.)

- Bias control in low-resource dialects.

- Indian sources like script, voice, text and video are still underrepresented in international datasets.

What datasets were used to train BharatGPT, and how was it ensured that they reflect India's linguistic-cultural diversity?

The experts who have worked on the development of BharatGen were already associated with the government's language project named 'Bhashini'. Because of this, data created by Bhashini—that is, previously collected text and audio in various Indian languages—was used in building BharatGen.

However, they did not rely solely on old data. The BharatGen team also prepared its own new collection (corpus) of text and audio, specifically in several Indian languages. The objective was to ensure that India's linguistic diversity could be well-represented.

And all this data was not used without scrutiny. A special process was created to thoroughly check the data's quality, accuracy, and usage permissions, to ensure that what the model was being taught was correct, balanced, and trustworthy.

How does BharatGen adhere to the principles of responsible and inclusive AI? What measures does it take to prevent misinformation bias and misuse?

From the very beginning in BharatGen, care was taken to ensure that all Indian languages are properly represented so that there would be no need for separate tuning later. For this, the data was selected carefully, and it was ensured that every language and region was fairly represented, meaning no language or community was left behind.

To measure and reduce bias in the model, special benchmarks based on Indian languages were used. Additionally, to measure the model's language understanding capability, scores like 'fertility' and 'perplexity' were utilised. These scores were used to check how fair and efficient the tokeniser (which breaks down language into a digital format) is across more than 22 Indian languages.

India has so far been dependent on international AI tools, so why was it necessary to create indigenous LLMs, and what will be its effect on digital sovereignty?

BharatGen's biggest strength is that it frees India from its dependence on foreign AI models. This allows the country to establish customised control in specific sectors where there is no internet connection (like air-gapped systems related to security or defense). This initiative will make India not just a consumer of technology, but also a creator of AI-related intellectual property. BharatGen is a strong link in the AI vision of a self-reliant India (Atmanirbhar Bharat).

What are the practical applications envisioned for BharatGen in sectors like education, agriculture, health, governance, and citizen services?

AI Tutor: Digital assistants that teach students in their native language.

Voice-based agricultural assistance: Providing answers to farmers' queries through voice in their own language.

Multilingual chatbots for government services: Enabling people to get information about government schemes and services in their own language.

Clinical decision support system: Helping doctors understand patient reports and decide on treatment.

Content creation in local languages: Generative AI tools that create content in local languages for education and administration.

From a global competitiveness perspective, where does India stand in AI today, and what needs to be done next for global leadership?

India is gaining strength in the field of AI with efforts like the IndiaAI Mission and BharatGen. However, for global leadership, companies will have to increase financial and technical investment and support innovation.

How important is the partnership between the government and academic institutions in large projects like BharatGPT? Specifically, what has been the role of IIT Bombay?

IIT Bombay is the nodal center for the BharatGen project. It has collaborated with six of the country's premier institutions to lead important tasks such as designing the model, gathering necessary data, providing training, and evaluating the model. This shows how large-scale academic institutions can collectively advance a technological project.

What policy or regulatory framework should India have for responsible AI development that also promotes innovation?

To develop AI responsibly, there should be some essential policies, such as: adequate compute access for research and training, transparency of data, checking for bias in local languages, ensuring accountability where AI is being used, and promoting AI that is compatible with Indian languages and societal needs.

What is the message for young Indian researchers, students, and innovators who are inspired by BharatGPT and want to build the next generation of ethical AI systems?

AI is not just technology; it is a perspective. The youth should deeply understand their own fields and think about how AI can be helpful within them. They should incorporate inclusivity, originality, and ethics into their thinking. BharatGen invites the youth to come together to shape the future of AI with Indian values and a global mindset.


(This article was translated for The Daily Jagran By Akansha Pandey.)