đ Project Indus LLM
Project Indus LLM is a groundbreaking open - source language model tailored for Hindi and its dialects. It aims to enhance natural language processing and generation across diverse Indian linguistic applications.
đ Quick Start
To quickly get started with Project Indus LLM, you can refer to the official documentation at <https://www.techmahindra.com/en - in/innovation/the - indus - project/>.
⨠Features
- Language - Specific Focus: Tailored for Hindi and its 37 dialects, addressing the linguistic diversity of India.
- Open - Source: Facilitates easy integration and further development by researchers and developers.
- Versatile Use Cases: Applicable in various industries such as call centers, healthcare, automotive, and telecom.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
No code examples are provided in the original document.
đ Documentation
Model Details
Model Description
Project Indus LLM is an open - source foundational model hosted on Hugging Face. It's a pretrained, instruct - tuned model in Hindi and its dialects.
- Developed by: Nikhil Malhotra, Nilesh Brahme, Satish Mishra, Vinay Sharma (Makers Lab, TechMahindra)
- Model type: Foundational Language model
- Language(s) (NLP): hin, bho, mai, doi
- License: other
- Parent Model: Built on GPT - 2 architecture from tokenizer to decoder
- Resources for more information: <https://www.techmahindra.com/en - in/innovation/the - indus - project/>
Uses
Direct Use
Project Indus can be directly used for generating text, simulating conversation, and other text generation tasks without additional training.
Downstream Use
It can be used for question - answering and conversation in Hindi and its dialects. After reward tuning, it can be applied across various industries such as call centers, healthcare, automotive, and telecom.
Out - of - Scope Use
Project Indus is not designed for high - stakes decision - making tasks like medical diagnosis or legal advice. Currently, it cannot be used for fill - in - the - blank exercises, multiple Q&A, and similar applications.
Bias, Risks, and Limitations
Significant research has explored bias and fairness issues with language models. Predictions generated by the model may include disturbing and harmful stereotypes. Although efforts have been made to remove biases from the training data, as a generative model, it may produce hallucinations. Any disturbing or harmful stereotypes produced by the model are purely unintentional and coincidental.
Recommendations
It is recommended to avoid biases and negative connotations in the model. Regular updates and community feedback are crucial for addressing any emergent bias or misuse scenarios.
Training Details
Infrastructure
- Training Infrastructure: Utilized high - performance computing resources provided by CDAC, featuring NVIDIA A100 GPUs.
- Running Infrastructure: Tested for both GPU (NVIDIA GeForce RTX 3070 or higher) and CPU (Intel Xeon Platinum 8580) environments.
Training Data
The Project Indus LLM was trained on a diverse dataset of Hindi text and its dialects.
- Data Sources and Collection:
- Open - Source Hindi Data: Collected from news portals, Wikipedia, commoncrawl.org, and 'Man ki Baat' from AIR.
- Translated Data: A portion of the Pile dataset was translated into Hindi using IndicTrans2 (AI4Bharat).
- Dialects: Data for major dialects like Maithili, Bhojpuri, Magahi, and Braj Bhasha was collected from multiple sources, including fieldwork.
Training Procedure
- Pre - training: Conducted on a dataset of 22 billion tokens using advanced tokenization techniques.
- Fine - Tuning: Supervised fine - tuning was performed with a focus on Indian languages, using custom datasets for cultural, political, and social contexts.
Phase |
Data Source |
Tokens |
Notes |
Pre - training |
Cleaned dataset of Hindi and dialects |
22 billion |
Utilized advanced tokenization |
Fine - tuning |
Custom datasets tailored for Indian languages |
Varied |
Focus on cultural, political, and social contexts |
Preprocessing
- Cleaning: Removed unwanted text, characters, and personal information. Performed transliteration and removed unwanted tags.
- Bias Removal: Used a Bias Removal Toolkit to detect and remove biased language.
- Tokenization: Used a custom tokenizer based on Byte Pair Encoding (BPE) with byte fallback for Hindi and its dialects.
Evaluation
Testing Data, Factors & Metrics
- Testing Data: Various datasets were used, including AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8k.
- Factors: Different shot configurations (e.g., 25 - shot, 10 - shot, 5 - shot, 0 - shot) were used for different datasets.
- Metrics: Metrics such as acc_norm, acc, and mc2 were used to evaluate the model.
Results
Model Examination
No specific details are provided in the original document.
Technical Specifications
Model Architecture and Objective
Built on GPT - 2 architecture, aiming to provide a robust language model for Indian languages.
Compute Infrastructure
- Hardware: Training used NVIDIA A100 GPUs; running was tested on NVIDIA GeForce RTX 3070 or higher GPUs and Intel Xeon Platinum 8580 CPUs.
- Software: Not specified in the original document.
Citation
No citation details are provided in the original document.
Glossary
No glossary is provided in the original document.
More Information
For more information, visit <https://www.techmahindra.com/en - in/innovation/the - indus - project/>.
Model Card Authors
Nikhil Malhotra, Nilesh Brahme, Satish Mishra, Vinay Sharma (Makers Lab, TechMahindra)
Model Card Contact
No contact details are provided in the original document.
How to Get Started with the Model
Refer to the official documentation at <https://www.techmahindra.com/en - in/innovation/the - indus - project/>.
đ License
The model is licensed under the osl - 3.0 license.