Ganga-2-1B Open-source Instruction Tuned Model - Free Deployment to Handle Linguistic Diversity in India

Ganga 2 1B

Developed by LingoIITGN

Ganga-2-1b is an instruction fine-tuned model trained on a Hindi dataset and is part of Project Unity, aiming to handle the diversity and richness of Indian languages.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Hindi monolingual model #High-performance Indian language processing #Instruction fine-tuning

Downloads 598

Release Time : 1/27/2025

Model Overview

This is a text completion model mainly used for fine-tuning downstream tasks and is not suitable for direct use as a chat or instruction-following model.

Model Features

India's first pre-trained Hindi model

This is the first pre-trained Hindi model launched by any academic research laboratory in India.

High performance

It outperforms existing open-source models supporting Indian languages, even those with up to 7 billion parameters in terms of performance.

High-quality dataset

The dataset includes news articles, web documents, books, government publications, educational materials, and social media conversations, and is further screened by native Indian speakers.

Model Capabilities

Text generation

Hindi text processing

English text processing

Use Cases

Translation

English-Hindi translation

Translate English text into Hindi.

Text completion

Hindi text completion

Generate subsequent content based on a given Hindi text prompt.

🚀 Ganga-2-1b Model

Ganga-2-1b is an instruct-tuned model trained on a monolingual Hindi language dataset as part of Project Unity. It aims to address India's linguistic diversity and achieve state-of-the-art performance in understanding and generating text in Indian languages.

🚀 Quick Start

Use the following code to start using the model:

from transformers import AutoModelForCausalLM, AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/ganga-2-1b")
model = AutoModelForCausalLM.from_pretrained("LingoIITGN/ganga-2-1b", device_map="auto")
  input_text = 'Translate it into Hindi "Innovation is the key to solving complex problems in the modern world."'
#input_text = '‡§á‡§∏ ‡§µ‡§æ‡§ï‡•ç‡§Ø ‡§ï‡§æ ‡§∏‡§π‡•Ä ‡§Ö‡§Ç‡§ó‡•ç‡§∞‡•á‡§ú‡§º‡•Ä ‡§Ö‡§®‡•Å‡§µ‡§æ‡§¶ ‡§ï‡§∞‡•á‡§Ç "‡§ú‡•Ä‡§µ‡§® ‡§Æ‡•á‡§Ç ‡§Ö‡§∏‡§≤‡•Ä ‡§ú‡•Ä‡§§ ‡§µ‡§π‡•Ä ‡§π‡•à, ‡§ú‡§¨ ‡§Ü‡§™ ‡§Ö‡§™‡§®‡•á ‡§°‡§∞ ‡§ï‡•ã ‡§π‡§∞‡§æ ‡§¶‡•á‡§Ç‡•§"'
#input_text = 'What is capital city of India?'
input_ids = tokenizer.encode("<bos><user>" + input_text + "<assistant>",
            return_tensors="pt").to("cuda")
outputs = model.generate(input_ids, max_new_tokens=100,
          do_sample=False)
print(tokenizer.decode(outputs[0]))

✨ Features

Project Unity Initiative: Aims to address India's linguistic diversity by creating models for major Indian languages.
High Performance: The Ganga-2-1B model outperforms existing open - source models supporting Indian languages, even those with up to 7 billion parameters.
High - Quality Dataset: Trained on a large, high - quality dataset of Hindi language data, curated by native Indian speakers.

📦 Installation

The installation process is included in the "Quick Start" section. You can use the transformers library to load the model and tokenizer.

📚 Documentation

Model Description

Project Unity is an initiative to tackle India's linguistic diversity and richness. We train models on monolingual regional languages of India. The first release, the Ganga - 1B model, was trained on a large dataset of public - domain web - crawled Hindi language data. The Ganga - 2 - 1B model shows better performance than existing open - source models for Indian languages.

Developed by: Lingo Research Group at IIT Gandhinagar
Model type: Autoregressive Language Model
Language(s) (NLP): Bilingual (Primary: Hindi [hi], Secondary: English [en])
License: Apache 2.0

Technical Specifications

Precision: BFloat16
Context Length: 2,048
Learning Rate: 4e - 4
Optimizer: AdamW
LR Scheduler: Cosine

Model Architecture and Objective

Ganga - 2 - 1b is a decoder - only transformer model with the following specifications:

Layers: 16
Attention heads: 32
Embedding dimension: 2,048
Vocabulary size: 32,768
Sliding window: 1024
Intermediate dimension: 7,168

🔧 Technical Details

The model is trained on a large - scale Hindi language dataset, which includes news articles, web documents, books, government publications, educational materials, and social media conversations. The dataset has been curated by native Indian speakers to ensure high quality. The model architecture is a decoder - only transformer, which is suitable for text generation tasks.

📄 License

The model is licensed under the Apache 2.0 license.

💻 Usage Examples

Example 1

BCCI ‡§®‡•á ‡§ü‡•Ä-20 ‡§µ‡§∞‡•ç‡§≤‡•ç‡§° ‡§ï‡§™ ‡§ï‡•á ‡§¨‡•Ä‡§ö ‡§ú‡§ø‡§Æ‡•ç‡§¨‡§æ‡§¨‡•ç‡§µ‡•á ‡§∏‡•Ä‡§∞‡•Ä‡§ú

Example 2

7 ‡§Ö‡§ï‡•ç‡§ü‡•Ç‡§¨‡§∞ ‡§ï‡•ã ‡§π‡§Æ‡§æ‡§∏ ‡§∏‡•á ‡§ú‡§Ç‡§ó ‡§∂‡•Å‡§∞‡•Ç ‡§π‡•ã‡§®‡•á ‡§ï‡•á ‡§∏‡§æ‡§§ ‡§Æ‡§π‡•Ä‡§®‡•á ‡§¨‡§æ‡§¶ ‡§á‡§ú‡§∞‡§æ‡§Ø‡§≤‡•Ä ‡§∏‡•á‡§®‡§æ

Example 3

‡§π‡§µ‡§æ ‡§Æ‡•á‡§Ç ‡§Ö‡§µ‡§æ‡§Ç‡§õ‡§ø‡§§ ‡§ó‡•à‡§∏‡•ã‡§Ç ‡§ï‡•Ä ‡§â‡§™‡§∏‡•ç‡§•‡§ø‡§§‡§ø ‡§∏‡•á ‡§Æ‡§®‡•Å‡§∑‡•ç‡§Ø, ‡§™‡§∂‡•Å‡§ì‡§Ç ‡§§‡§•‡§æ ‡§™‡§ï‡•ç‡§∑‡§ø‡§Ø‡•ã‡§Ç ‡§ï‡•ã

Example 4

‡§™‡§π‡§≤‡•á ‡§∏‡§Ç‡§¶‡§ø‡§ó‡•ç‡§ß ‡§Æ‡§æ‡§Æ‡§≤‡•ã‡§Ç ‡§ï‡•ã 31 ‡§¶‡§ø‡§∏‡§Ç‡§¨‡§∞ 2019 ‡§ï‡•ã WHO ‡§ï‡•ã ‡§∏‡•Ç‡§ö‡§ø‡§§ ‡§ï‡§ø‡§Ø‡§æ ‡§ó‡§Ø‡§æ ‡§•‡§æ,

Example 5

13 ‡§∏‡§Æ‡§®‡•ç‡§µ‡§ø‡§§ ‡§¨‡§Æ ‡§µ‡§ø‡§∏‡•ç‡§´‡•ã‡§ü‡•ã‡§Ç ‡§ï‡•á ‡§¨‡§æ‡§¶ ‡§∏‡•á ‡§Æ‡•Å‡§Ç‡§¨‡§à ‡§Æ‡•á‡§Ç ‡§ï‡§à ‡§ó‡•à‡§∞-‡§∞‡§æ‡§ú‡•ç‡§Ø ‡§π‡§Æ‡§≤‡•á

Example 6

‡§®‡§ø‡§ï‡•ã‡§≤‡§æ ‡§ü‡•á‡§∏‡•ç‡§≤‡§æ ‡§ï‡§æ ‡§ú‡§®‡•ç‡§Æ 10 ‡§ú‡•Å‡§≤‡§æ‡§à 1856 ‡§ï‡•ã ‡§∏‡•ç‡§ï‡§ø‡§Æ‡§°‡§ú‡§º, ‡§ï‡•ç‡§∞‡•ã‡§è‡§∞‡§ø‡§Ø‡§æ ‡§Æ‡•á‡§Ç ‡§π‡•Å‡§Ü ‡§•‡§æ,

Example 7

2007 ‡§ü‡•Ç‡§∞‡•ç‡§®‡§æ‡§Æ‡•á‡§Ç‡§ü ‡§Æ‡•á‡§Ç ‡§ï‡•ç‡§∞‡§ø‡§ï‡§ü ‡§µ‡§ø‡§∂‡•ç‡§µ ‡§ï‡§™ ‡§ï‡•á ‡§≤‡§ø‡§è ‡§ü‡§ø‡§ï‡§ü‡•ã‡§Ç ‡§∏‡•á ‡§∏‡§¨‡§∏‡•á ‡§ú‡•ç‡§Ø‡§æ‡§¶‡§æ ‡§Ü‡§Æ‡§¶‡§®‡•Ä ‡§π‡•Å‡§à

Evaluation

Results

Tokenizers Results

Model	Fertility
Ganga-2-1b	1.12
Pragna-1b	1.58
Bloom-1b1	1.27
Bloom-1b7	1.27
Gemma-2b	1.89
Bloom-3b	1.27
Airavata-7b	1.69
Sarvam-2b	1.38

Metrics

Model	PPL_{Sangraha Dataset}
Ganga-2-1b	8.09
Ganga-1b	15.82
Pragna-1b	9.37
Bloom-1b1	17.49
Bloom-1b7	14.28
Gemma-2b	31.01
Bloom-3b	12.82
OpenHathi-7B	25.73
Airavata-7b	38.24
Sarvam-2b	10.31

Bias, Risks, and Limitations

Recommendations

⚠️ Important Note

This model is a research preview and is under continuous iterative updates. It has limited safety measures and may generate offensive content. It is strictly prohibited to use the model for any illegal, harmful, violent, racist, or sexual purposes.

Model Card Contact

Lingo Research Group at IIT Gandhinagar, India
Mail at: lingo@iitgn.ac.in

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご