🚀 Ganga-2-1b Model
Ganga-2-1b is an instruct-tuned model trained on a monolingual Hindi language dataset as part of Project Unity. It aims to address India's linguistic diversity and achieve state-of-the-art performance in understanding and generating text in Indian languages.
🚀 Quick Start
Use the following code to start using the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("LingoIITGN/ganga-2-1b")
model = AutoModelForCausalLM.from_pretrained("LingoIITGN/ganga-2-1b", device_map="auto")
input_text = 'Translate it into Hindi "Innovation is the key to solving complex problems in the modern world."'
input_ids = tokenizer.encode("<bos><user>" + input_text + "<assistant>",
return_tensors="pt").to("cuda")
outputs = model.generate(input_ids, max_new_tokens=100,
do_sample=False)
print(tokenizer.decode(outputs[0]))
✨ Features
- Project Unity Initiative: Aims to address India's linguistic diversity by creating models for major Indian languages.
- High Performance: The Ganga-2-1B model outperforms existing open - source models supporting Indian languages, even those with up to 7 billion parameters.
- High - Quality Dataset: Trained on a large, high - quality dataset of Hindi language data, curated by native Indian speakers.
📦 Installation
The installation process is included in the "Quick Start" section. You can use the transformers
library to load the model and tokenizer.
📚 Documentation
Model Description
Project Unity is an initiative to tackle India's linguistic diversity and richness. We train models on monolingual regional languages of India. The first release, the Ganga - 1B model, was trained on a large dataset of public - domain web - crawled Hindi language data. The Ganga - 2 - 1B model shows better performance than existing open - source models for Indian languages.
Technical Specifications
- Precision: BFloat16
- Context Length: 2,048
- Learning Rate: 4e - 4
- Optimizer: AdamW
- LR Scheduler: Cosine
Model Architecture and Objective
Ganga - 2 - 1b is a decoder - only transformer model with the following specifications:
- Layers: 16
- Attention heads: 32
- Embedding dimension: 2,048
- Vocabulary size: 32,768
- Sliding window: 1024
- Intermediate dimension: 7,168
🔧 Technical Details
The model is trained on a large - scale Hindi language dataset, which includes news articles, web documents, books, government publications, educational materials, and social media conversations. The dataset has been curated by native Indian speakers to ensure high quality. The model architecture is a decoder - only transformer, which is suitable for text generation tasks.
📄 License
The model is licensed under the Apache 2.0 license.
💻 Usage Examples
Example 1
BCCI ने टी-20 वर्ल्ड कप के बीच जिम्बाब्वे सीरीज
Example 2
7 अक्टूबर को हमास से जंग शुरू होने के सात महीने बाद इजरायली सेना
Example 3
हवा में अवांछित गैसों की उपस्थिति से मनुष्य, पशुओं तथा पक्षियों को
Example 4
पहले संदिग्ध मामलों को 31 दिसंबर 2019 को WHO को सूचित किया गया था,
Example 5
13 समन्वित बम विस्फोटों के बाद से मुंबई में कई गैर-राज्य हमले
Example 6
निकोला टेस्ला का जन्म 10 जुलाई 1856 को स्किमडज़, क्रोएरिया में हुआ था,
Example 7
2007 टूर्नामेंट में क्रिकट विश्व कप के लिए टिकटों से सबसे ज्यादा आमदनी हुई
Evaluation
Results
Tokenizers Results
Model |
Fertility |
Ganga-2-1b |
1.12 |
Pragna-1b |
1.58 |
Bloom-1b1 |
1.27 |
Bloom-1b7 |
1.27 |
Gemma-2b |
1.89 |
Bloom-3b |
1.27 |
Airavata-7b |
1.69 |
Sarvam-2b |
1.38 |
Metrics
Model |
PPLSangraha Dataset |
Ganga-2-1b |
8.09 |
Ganga-1b |
15.82 |
Pragna-1b |
9.37 |
Bloom-1b1 |
17.49 |
Bloom-1b7 |
14.28 |
Gemma-2b |
31.01 |
Bloom-3b |
12.82 |
OpenHathi-7B |
25.73 |
Airavata-7b |
38.24 |
Sarvam-2b |
10.31 |
Bias, Risks, and Limitations
Recommendations
⚠️ Important Note
This model is a research preview and is under continuous iterative updates. It has limited safety measures and may generate offensive content. It is strictly prohibited to use the model for any illegal, harmful, violent, racist, or sexual purposes.
Model Card Contact
Lingo Research Group at IIT Gandhinagar, India
Mail at: lingo@iitgn.ac.in