Model Overview
Model Features
Model Capabilities
Use Cases
ЁЯЪА Airavata
Airavata is a 7B OpenHathi model fine - tuned on the IndicInstruct dataset. This dataset is a collection of various instruction datasets. It aims to provide high - quality text generation capabilities, especially in multilingual and instruction - tuned scenarios, as presented in the technical report.
ЁЯЪА Quick Start
Prerequisites
Clone https://github.com/AI4Bharat/IndicInstruct and install the required dependencies. Then download or clone this model to the same machine.
Input Format
The model is trained to use the chat format similar to [open - instruct code repository](https://github.com/allenai/open - instruct) (note the newlines):
<|user|>
Your message here!
<|assistant|>
For best results, format all inputs in this manner. Make sure to include a newline after <|assistant|>
, this can affect generation quality quite a bit.
тЬи Features
- Multilingual Support: Supports languages like English and Hindi.
- Instruction - Tuned: Fine - tuned on a diverse set of instruction datasets for better performance in text generation tasks.
ЁЯУж Installation
Clone https://github.com/AI4Bharat/IndicInstruct and install the required dependencies. Then download or clone this model to the same machine.
ЁЯТ╗ Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = "cuda" if torch.cuda.is_available() else "cpu"
def create_prompt_with_chat_format(messages, bos="<s>", eos="</s>", add_bos=True):
formatted_text = ""
for message in messages:
if message["role"] == "system":
formatted_text += "<|system|>\n" + message["content"] + "\n"
elif message["role"] == "user":
formatted_text += "<|user|>\n" + message["content"] + "\n"
elif message["role"] == "assistant":
formatted_text += "<|assistant|>\n" + message["content"].strip() + eos + "\n"
else:
raise ValueError(
"Tulu chat template only supports 'system', 'user' and 'assistant' roles. Invalid role: {}.".format(
message["role"]
)
)
formatted_text += "<|assistant|>\n"
formatted_text = bos + formatted_text if add_bos else formatted_text
return formatted_text
def inference(input_prompts, model, tokenizer):
input_prompts = [
create_prompt_with_chat_format([{"role": "user", "content": input_prompt}], add_bos=False)
for input_prompt in input_prompts
]
encodings = tokenizer(input_prompts, padding=True, return_tensors="pt")
encodings = encodings.to(device)
with torch.inference_mode():
outputs = model.generate(encodings.input_ids, do_sample=False, max_new_tokens=250)
output_texts = tokenizer.batch_decode(outputs.detach(), skip_special_tokens=True)
input_prompts = [
tokenizer.decode(tokenizer.encode(input_prompt), skip_special_tokens=True) for input_prompt in input_prompts
]
output_texts = [output_text[len(input_prompt) :] for input_prompt, output_text in zip(input_prompts, output_texts)]
return output_texts
model_name = "ai4bharat/Airavata"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(device)
input_prompts = [
"рдореИрдВ рдЕрдкрдиреЗ рд╕рдордп рдкреНрд░рдмрдВрдзрди рдХреМрд╢рд▓ рдХреЛ рдХреИрд╕реЗ рд╕реБрдзрд╛рд░ рд╕рдХрддрд╛ рд╣реВрдБ? рдореБрдЭреЗ рдкрд╛рдВрдЪ рдмрд┐рдВрджреБ рдмрддрд╛рдПрдВред",
"рдореИрдВ рдЕрдкрдиреЗ рд╕рдордп рдкреНрд░рдмрдВрдзрди рдХреМрд╢рд▓ рдХреЛ рдХреИрд╕реЗ рд╕реБрдзрд╛рд░ рд╕рдХрддрд╛ рд╣реВрдБ? рдореБрдЭреЗ рдкрд╛рдВрдЪ рдмрд┐рдВрджреБ рдмрддрд╛рдПрдВ рдФрд░ рдЙрдирдХрд╛ рд╡рд░реНрдгрди рдХрд░реЗрдВред",
]
outputs = inference(input_prompts, model, tokenizer)
print(outputs)
ЁЯУЪ Documentation
Model Details
This model is a fine - tuned version of the [OpenHathi](https://huggingface.co/sarvamai/OpenHathi - 7B - Hi - v0.1 - Base) model on the [IndicInstruct dataset](https://huggingface.co/datasets/ai4bharat/indic - instruct - data - v0.1). It was trained as part of the technical report Airavata: Introducing Hindi Instruction - tuned LLM. The codebase used for training and evaluation can be found at https://github.com/AI4Bharat/IndicInstruct.
Hyperparameters
We fine - tune the OpenHathi base model on the aforementioned IndicInstruct dataset with LoRA. The hyperparameters for the LoRA fine - tuning are listed below:
- LoRA Rank: 16
- LoRA alpha: 32
- LoRA Dropout: 0.05
- LoRA Target Modules: ["q_proj", "v_proj", "k_proj", "down_proj", "gate_proj", "up_proj"]
- Epochs: 4
- Learning rate: 5e - 4
- Batch Size: 128
- Floating Point Precision: bfloat16
We recommend the readers to check out our official blog post for more details on the model training, ablations and evaluation results.
ЁЯФз Technical Details
Model Architecture
The model is based on the OpenHathi architecture and is fine - tuned using LoRA on the IndicInstruct dataset. This approach allows for efficient fine - tuning and better performance in instruction - following tasks.
Training Data
The model is trained on the [IndicInstruct dataset](https://huggingface.co/datasets/ai4bharat/indic - instruct - data - v0.1), which is a collection of instruction datasets including Anudesh, wikiHow, Flan v2, Dolly, Anthropic - HHH, OpenAssistant v1, and LymSys - Chat.
ЁЯУД License
This model is licensed under the llama2 license.
ЁЯУК Evaluation Results
Open LLM Leaderboard Evaluation Results
Detailed results can be found [here](https://huggingface.co/datasets/open - llm - leaderboard/details_ai4bharat__Airavata)
Metric | Value |
---|---|
Avg. | 45.52 |
AI2 Reasoning Challenge (25 - Shot) | 46.50 |
HellaSwag (10 - Shot) | 69.26 |
MMLU (5 - Shot) | 43.90 |
TruthfulQA (0 - shot) | 40.62 |
Winogrande (5 - shot) | 68.82 |
GSM8k (5 - shot) | 4.02 |
ЁЯУЦ Citation
@article{gala2024airavata,
title = {Airavata: Introducing Hindi Instruction - tuned LLM},
author = {Jay Gala and Thanmay Jayakumar and Jaavid Aktar Husain and Aswanth Kumar M and Mohammed Safi Ur Rahman Khan and Diptesh Kanojia and Ratish Puduppully and Mitesh M. Khapra and Raj Dabre and Rudra Murthy and Anoop Kunchukuttan},
year = {2024},
journal = {arXiv preprint arXiv: 2401.15006}
}

