ProjectIndus Open-Source Language Model - Free Deployment to Enhance Hindi and Dialects' Natural Language Processing Capabilities

Projectindus

Developed by nickmalhotra

Project Indus LLM is a groundbreaking open-source language model specifically designed for Hindi and its dialects, aiming to enhance natural language processing and generation capabilities in India's diverse linguistic applications.

Large Language Model

Transformers

#Hindi generation #Dialect support #Indian language processing

Downloads 255

Release Time : 4/18/2024

Model Overview

Project Indus LLM is a pre-trained Hindi and dialect model, fine-tuned with instructions, focusing on the diversity of Indian languages.

Model Features

Multilingual support

Designed for Hindi and its 37 dialects, supporting diverse linguistic applications in India.

Open-source base model

Hosted on Hugging Face for easy integration and further development by researchers and developers.

High-performance computing training

Trained using high-performance computing resources provided by CDAC, equipped with NVIDIA A100 GPUs.

Extensive dataset

Trained on a diverse and extensive dataset containing 22 billion tokens, covering Hindi and its dialects.

Model Capabilities

Text generation

Question answering

Dialogue simulation

Hindi text understanding

Use Cases

Call center

Customer support

Used to handle queries and complaints from Hindi-speaking customers.

Healthcare

Medical information query

Provides medical information and advice in Hindi.

Automotive

Vehicle information query

Answers Hindi questions about vehicle maintenance and usage.

Telecommunications

Service query

Handles Hindi queries about telecom services.

🚀 Project Indus LLM

Project Indus LLM is a groundbreaking open - source language model tailored for Hindi and its dialects. It aims to enhance natural language processing and generation across diverse Indian linguistic applications.

🚀 Quick Start

To quickly get started with Project Indus LLM, you can refer to the official documentation at <https://www.techmahindra.com/en - in/innovation/the - indus - project/>.

✨ Features

Language - Specific Focus: Tailored for Hindi and its 37 dialects, addressing the linguistic diversity of India.
Open - Source: Facilitates easy integration and further development by researchers and developers.
Versatile Use Cases: Applicable in various industries such as call centers, healthcare, automotive, and telecom.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Model Details

Model Description

Project Indus LLM is an open - source foundational model hosted on Hugging Face. It's a pretrained, instruct - tuned model in Hindi and its dialects.

Developed by: Nikhil Malhotra, Nilesh Brahme, Satish Mishra, Vinay Sharma (Makers Lab, TechMahindra)
Model type: Foundational Language model
Language(s) (NLP): hin, bho, mai, doi
License: other
Parent Model: Built on GPT - 2 architecture from tokenizer to decoder
Resources for more information: <https://www.techmahindra.com/en - in/innovation/the - indus - project/>

Uses

Direct Use

Project Indus can be directly used for generating text, simulating conversation, and other text generation tasks without additional training.

Downstream Use

It can be used for question - answering and conversation in Hindi and its dialects. After reward tuning, it can be applied across various industries such as call centers, healthcare, automotive, and telecom.

Out - of - Scope Use

Project Indus is not designed for high - stakes decision - making tasks like medical diagnosis or legal advice. Currently, it cannot be used for fill - in - the - blank exercises, multiple Q&A, and similar applications.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models. Predictions generated by the model may include disturbing and harmful stereotypes. Although efforts have been made to remove biases from the training data, as a generative model, it may produce hallucinations. Any disturbing or harmful stereotypes produced by the model are purely unintentional and coincidental.

Recommendations

It is recommended to avoid biases and negative connotations in the model. Regular updates and community feedback are crucial for addressing any emergent bias or misuse scenarios.

Training Details

Infrastructure

Training Infrastructure: Utilized high - performance computing resources provided by CDAC, featuring NVIDIA A100 GPUs.
Running Infrastructure: Tested for both GPU (NVIDIA GeForce RTX 3070 or higher) and CPU (Intel Xeon Platinum 8580) environments.

Training Data

The Project Indus LLM was trained on a diverse dataset of Hindi text and its dialects.

Data Sources and Collection:
- Open - Source Hindi Data: Collected from news portals, Wikipedia, commoncrawl.org, and 'Man ki Baat' from AIR.
- Translated Data: A portion of the Pile dataset was translated into Hindi using IndicTrans2 (AI4Bharat).
- Dialects: Data for major dialects like Maithili, Bhojpuri, Magahi, and Braj Bhasha was collected from multiple sources, including fieldwork.

Training Procedure

Pre - training: Conducted on a dataset of 22 billion tokens using advanced tokenization techniques.
Fine - Tuning: Supervised fine - tuning was performed with a focus on Indian languages, using custom datasets for cultural, political, and social contexts.

Phase	Data Source	Tokens	Notes
Pre - training	Cleaned dataset of Hindi and dialects	22 billion	Utilized advanced tokenization
Fine - tuning	Custom datasets tailored for Indian languages	Varied	Focus on cultural, political, and social contexts

Preprocessing

Cleaning: Removed unwanted text, characters, and personal information. Performed transliteration and removed unwanted tags.
Bias Removal: Used a Bias Removal Toolkit to detect and remove biased language.
Tokenization: Used a custom tokenizer based on Byte Pair Encoding (BPE) with byte fallback for Hindi and its dialects.

Evaluation

Testing Data, Factors & Metrics

Testing Data: Various datasets were used, including AI2 Reasoning Challenge, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8k.
Factors: Different shot configurations (e.g., 25 - shot, 10 - shot, 5 - shot, 0 - shot) were used for different datasets.
Metrics: Metrics such as acc_norm, acc, and mc2 were used to evaluate the model.

Results

Task	Dataset	Metrics	Value	Source
Text Generation	AI2 Reasoning Challenge (25 - Shot)	acc_norm	22.7	https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/ProjectIndus
Text Generation	HellaSwag (10 - Shot)	acc_norm	25.04	https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
Text Generation	MMLU (5 - Shot)	acc	23.12	https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
Text Generation	TruthfulQA (0 - shot)	mc2	0.0	https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
Text Generation	Winogrande (5 - shot)	acc	49.57	https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
Text Generation	GSM8k (5 - shot)	acc	0.0	https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B

Model Examination

No specific details are provided in the original document.

Technical Specifications

Model Architecture and Objective

Built on GPT - 2 architecture, aiming to provide a robust language model for Indian languages.

Compute Infrastructure

Hardware: Training used NVIDIA A100 GPUs; running was tested on NVIDIA GeForce RTX 3070 or higher GPUs and Intel Xeon Platinum 8580 CPUs.
Software: Not specified in the original document.

Citation

No citation details are provided in the original document.

Glossary

No glossary is provided in the original document.

More Information

For more information, visit <https://www.techmahindra.com/en - in/innovation/the - indus - project/>.

Model Card Authors

Nikhil Malhotra, Nilesh Brahme, Satish Mishra, Vinay Sharma (Makers Lab, TechMahindra)

Model Card Contact

No contact details are provided in the original document.

How to Get Started with the Model

Refer to the official documentation at <https://www.techmahindra.com/en - in/innovation/the - indus - project/>.

📄 License

The model is licensed under the osl - 3.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご