AryaBhatta-GemmaOrca-Merged Open-Source Large Language Model - Supports 10 Languages, Enhances Mathematical Reasoning Ability

Aryabhatta GemmaOrca Merged

Developed by GenVRadmin

A multilingual large language model fine-tuned based on Gemma, optimized for 9 Indian languages and English, with enhanced mathematical and reasoning capabilities

Large Language Model

Transformers

Open Source License:MIT #Multilingual optimization for India #Enhanced mathematical capabilities #Orca fine-tuning

Downloads 39

Release Time : 4/1/2024

Model Overview

This model is part of the AryaBhatta series, improving mathematical abilities through fine-tuning with the Orca dataset and optimizing Indian language processing performance with multilingual datasets

Model Features

Multilingual support

Specifically optimized for processing 9 Indian languages, covering major South Asian language families

Enhanced mathematical capabilities

Significant improvement in mathematical reasoning scores through fine-tuning with the Orca math dataset (GemmaOrca version achieved 31.6 points)

Hybrid architecture

Offers two variants: the native Gemma version and the Zephyr-Gemma version, catering to different needs

Cultural adaptation

Incorporates localized datasets like Samvaad to enhance understanding of Indian contexts

Model Capabilities

Multilingual text generation

Mathematical problem solving

Instruction following

Cross-language translation

Culture-related Q&A

Use Cases

Education

Multilingual math tutoring

Solving math word problems in local Indian languages

Outperforms the base Gemma model on the microsoft/orca-math test set

Localization services

Regional language customer support

Handling customer inquiries in various Indian dialects

🚀 AryaBhatta Model Series

This project presents the AryaBhatta model series, which consists of two models: AryaBhatta - 1 and AryaBhatta - 2. These models are fine - tuned from either HuggingFaceH4/zephyr - 7b - gemma - v0.1 or Google/gemma and are optimized for 9 Indian languages (Hindi, Tamil, Punjabi, Bengali, Gujarati, Oriya, Telugu, Kannada, Malayalam) along with English.

✨ Features

Multi - language Support: Fine - tuned on 9 Indian languages and English, enabling broader language coverage.
Enhanced Reasoning and Math Skills: By fine - tuning on Microsoft's Orca datasets, the model significantly improves in mathematical reasoning.
Benchmark Performance: Achieves competitive scores on various benchmarks compared to other models.

📦 Installation

No specific installation steps are provided in the original document. If you plan to use the model, you can follow the usage example below.

💻 Usage Examples

Basic Usage

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

model = AutoPeftModelForCausalLM.from_pretrained(
    "GenVRadmin/AryaBhatta-GemmaOrca",
    load_in_4bit = False,
    token = hf_token
)
tokenizer = AutoTokenizer.from_pretrained("GenVRadmin/AryaBhatta-GemmaOrca")

input_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""

input_text = input_prompt.format(
        "Answer this question about India.", # instruction
        "Who is the Prime Minister of India", # input
        "", # output - leave this blank for generation!
    )

inputs = tokenizer([input_text], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 300, use_cache = True)
response = tokenizer.batch_decode(outputs)[0]

🔧 Technical Details

Fine - tuning Bases: The models are fine - tuned from HuggingFaceH4/zephyr - 7b - gemma - v0.1 or Google/gemma.
Initial Tuning: To enhance reasoning and math skills, the models are first SFT tuned on Microsoft's Orca datasets, including the Orca maths Hindi dataset (GenVRadmin/Aryabhatta - Orca - Maths - Hindi) and the original Orca maths dataset (microsoft/orca - math - word - problems - 200k). This boosts the MATHS score from 24.3 in Gemma - 7B to 25.5 in Zephyr - Gemma and 31.6 in GemmaOrca.
Subsequent Tuning: The models are then fine - tuned on GenVR's Samvaad datasets (GenVRadmin/Samvaad - Indic - Positive, GenVRadmin/Samvaad - Tamil - Mixtral, and a subset of GenVRadmin/Samvaad - Mixed - Language - 3), followed by various open - sourced datasets such as Telugu - LLM - Labs/yahma_alpaca_cleaned_telugu_filtered_and_romanized, abhinand/tamil - alpaca, etc.

📚 Documentation

Model Variants

There are two models in the AryaBhatta series. One is fine - tuned on Google's Gemma, and the other is fine - tuned on Zephyr's Gemma base. The repo for the Zephyr - based model is GenVRadmin/AryaBhatta - GemmaOrca - 2 - Merged.

Benchmark Scores

Model	AGIEval	GPT4All	TruthfulQA	BigBench	Average ⬇️
AryaBhatta - GemmaOrca	35.9	72.26	53.85	40.35	50.59
zephyr - 7b - beta	37.52	71.77	55.26	39.77	51.08
zephyr - 7b - gemma - v0.1	34.22	66.37	52.19	37.10	47.47
mlabonne/Gemmalpaca - 7B	21.6	40.87	44.85	30.49	34.45
google/gemma - 7b - it	21.33	40.84	41.70	30.25	33.53

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご