Airavata Multilingual Large Language Model - Open source, supporting English-Hindi bilingual communication and suitable for various scenarios

Airavata

Developed by ai4bharat

Airavata is a multilingual large language model fine-tuned on the IndicInstruct dataset based on the 7B-parameter OpenHathi model, supporting both English and Hindi.

Large Language Model Supports Multiple Languages#Hindi instruction fine-tuning #Multilingual generation #LoRA optimization

Downloads 992

Release Time : 1/13/2024

Model Overview

This is an instruction-fine-tuned large language model specifically optimized for Hindi and English, suitable for various text generation tasks.

Model Features

Multilingual support

Specifically optimized for Hindi and English, supporting bilingual text generation

Instruction fine-tuning

Fine-tuned on the IndicInstruct dataset, which includes multiple instruction datasets

Efficient training

Uses LoRA technology for fine-tuning, reducing computational resource requirements

Model Capabilities

Text generation

Multilingual processing

Instruction understanding and execution

Use Cases

Education

Language learning assistance

Helps learners practice writing and comprehension in Hindi and English

Content creation

Multilingual content generation

Generates bilingual content in English and Hindi

🚀 Airavata

Airavata is a 7B OpenHathi model fine - tuned on the IndicInstruct dataset. This dataset is a collection of various instruction datasets. It aims to provide high - quality text generation capabilities, especially in multilingual and instruction - tuned scenarios, as presented in the technical report.

🚀 Quick Start

Prerequisites

Clone https://github.com/AI4Bharat/IndicInstruct and install the required dependencies. Then download or clone this model to the same machine.

Input Format

The model is trained to use the chat format similar to [open - instruct code repository](https://github.com/allenai/open - instruct) (note the newlines):

<|user|>
Your message here!
<|assistant|>

For best results, format all inputs in this manner. Make sure to include a newline after <|assistant|>, this can affect generation quality quite a bit.

✨ Features

Multilingual Support: Supports languages like English and Hindi.
Instruction - Tuned: Fine - tuned on a diverse set of instruction datasets for better performance in text generation tasks.

📦 Installation

Clone https://github.com/AI4Bharat/IndicInstruct and install the required dependencies. Then download or clone this model to the same machine.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"


def create_prompt_with_chat_format(messages, bos="<s>", eos="</s>", add_bos=True):
    formatted_text = ""
    for message in messages:
        if message["role"] == "system":
            formatted_text += "<|system|>\n" + message["content"] + "\n"
        elif message["role"] == "user":
            formatted_text += "<|user|>\n" + message["content"] + "\n"
        elif message["role"] == "assistant":
            formatted_text += "<|assistant|>\n" + message["content"].strip() + eos + "\n"
        else:
            raise ValueError(
                "Tulu chat template only supports 'system', 'user' and 'assistant' roles. Invalid role: {}.".format(
                    message["role"]
                )
            )
    formatted_text += "<|assistant|>\n"
    formatted_text = bos + formatted_text if add_bos else formatted_text
    return formatted_text


def inference(input_prompts, model, tokenizer):
    input_prompts = [
        create_prompt_with_chat_format([{"role": "user", "content": input_prompt}], add_bos=False)
        for input_prompt in input_prompts
    ]

    encodings = tokenizer(input_prompts, padding=True, return_tensors="pt")
    encodings = encodings.to(device)

    with torch.inference_mode():
        outputs = model.generate(encodings.input_ids, do_sample=False, max_new_tokens=250)

    output_texts = tokenizer.batch_decode(outputs.detach(), skip_special_tokens=True)

    input_prompts = [
        tokenizer.decode(tokenizer.encode(input_prompt), skip_special_tokens=True) for input_prompt in input_prompts
    ]
    output_texts = [output_text[len(input_prompt) :] for input_prompt, output_text in zip(input_prompts, output_texts)]
    return output_texts


model_name = "ai4bharat/Airavata"

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to(device)

input_prompts = [
    "मैं अपने समय प्रबंधन कौशल को कैसे सुधार सकता हूँ? मुझे पांच बिंदु बताएं।",
    "मैं अपने समय प्रबंधन कौशल को कैसे सुधार सकता हूँ? मुझे पांच बिंदु बताएं और उनका वर्णन करें।",
]
outputs = inference(input_prompts, model, tokenizer)
print(outputs)

📚 Documentation

Model Details

This model is a fine - tuned version of the [OpenHathi](https://huggingface.co/sarvamai/OpenHathi - 7B - Hi - v0.1 - Base) model on the [IndicInstruct dataset](https://huggingface.co/datasets/ai4bharat/indic - instruct - data - v0.1). It was trained as part of the technical report Airavata: Introducing Hindi Instruction - tuned LLM. The codebase used for training and evaluation can be found at https://github.com/AI4Bharat/IndicInstruct.

Hyperparameters

We fine - tune the OpenHathi base model on the aforementioned IndicInstruct dataset with LoRA. The hyperparameters for the LoRA fine - tuning are listed below:

LoRA Rank: 16
LoRA alpha: 32
LoRA Dropout: 0.05
LoRA Target Modules: ["q_proj", "v_proj", "k_proj", "down_proj", "gate_proj", "up_proj"]
Epochs: 4
Learning rate: 5e - 4
Batch Size: 128
Floating Point Precision: bfloat16

We recommend the readers to check out our official blog post for more details on the model training, ablations and evaluation results.

🔧 Technical Details

Model Architecture

The model is based on the OpenHathi architecture and is fine - tuned using LoRA on the IndicInstruct dataset. This approach allows for efficient fine - tuning and better performance in instruction - following tasks.

Training Data

The model is trained on the [IndicInstruct dataset](https://huggingface.co/datasets/ai4bharat/indic - instruct - data - v0.1), which is a collection of instruction datasets including Anudesh, wikiHow, Flan v2, Dolly, Anthropic - HHH, OpenAssistant v1, and LymSys - Chat.

📄 License

This model is licensed under the llama2 license.

📊 Evaluation Results

Open LLM Leaderboard Evaluation Results

Detailed results can be found [here](https://huggingface.co/datasets/open - llm - leaderboard/details_ai4bharat__Airavata)

Metric	Value
Avg.	45.52
AI2 Reasoning Challenge (25 - Shot)	46.50
HellaSwag (10 - Shot)	69.26
MMLU (5 - Shot)	43.90
TruthfulQA (0 - shot)	40.62
Winogrande (5 - shot)	68.82
GSM8k (5 - shot)	4.02

📖 Citation

@article{gala2024airavata,
  title   = {Airavata: Introducing Hindi Instruction - tuned LLM},
  author  = {Jay Gala and Thanmay Jayakumar and Jaavid Aktar Husain and Aswanth Kumar M and Mohammed Safi Ur Rahman Khan and Diptesh Kanojia and Ratish Puduppully and Mitesh M. Khapra and Raj Dabre and Rudra Murthy and Anoop Kunchukuttan},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2401.15006}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご