Llama-3.1-8B-VaaniSetu-EN2PA Open-Source Translation Model - Achieve Accurate English-to-Punjabi Translation for Free

Llama 3.1 8B VaaniSetu EN2PA

Developed by partex-nv

A fine-tuned English-to-Punjabi translation model based on the LLaMA 3.1 8B architecture, trained using the Bharat parallel corpus.

Machine Translation

Safetensors

Supports Multiple Languages#English-Punjabi Translation #Legal Document Translation #LLaMA3.1 Fine-tuning

Downloads 48

Release Time : 9/25/2024

Model Overview

This model is specifically designed for English-to-Punjabi translation, suitable for translating legal documents, government orders, and other documents for Punjabi speakers.

Model Features

High-quality Translation

Trained on 10 million English<>Punjabi parallel sentence pairs, providing high-quality translation results.

Open-source Model

Fills the gap in open-source English-to-Punjabi translation models.

Specialized Domain Applicability

Especially suitable for translating specialized documents such as legal files and government orders.

Model Capabilities

English-to-Punjabi Translation

Text Generation

Use Cases

Document Translation

Legal Document Translation

Translate English legal documents into Punjabi.

Government Order Translation

Translate English government orders into Punjabi.

🚀 🦙📝 LLAMA-VaaniSetu-EN2PA: English to Punjabi Translation with Large Language Models

This model, LLAMA-VaaniSetu-EN2PA, is a fine - tuned version of the LLaMA 3.1 8B architecture. It's specifically crafted for English to Punjabi translation. Trained on the Bharat Parallel Corpus Collection (BPCC) with about 10 million English<>Punjabi pairs from AI4Bharat, it aims to fill the gap in open - source English to Punjabi translation models and can be used for translating various documents for Punjabi - speaking people.

✨ Features

Targeted Translation: Specialized for English to Punjabi translation.
Large - scale Training: Utilizes 10 million parallel English - Punjabi sentences from BPCC.
Potential Applications: Ideal for translating judicial documents, government orders, court judgments, etc.

📦 Installation

Requirements

Python 3.8.10 or above
Required Python packages:
- transformers
- torch
- huggingface_hub

Installation Instructions

To use this model, make sure you have the following dependencies installed:

pip install torch transformers huggingface_hub

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


# Load model and tokenizer
def load_model():
    tokenizer = AutoTokenizer.from_pretrained("partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA")
    model = AutoModelForCausalLM.from_pretrained(
        "partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA",
        torch_dtype=torch.bfloat16,
        device_map="auto",  # Automatically moves model to GPU
    )
    return model, tokenizer

model, tokenizer = load_model()

# Define the function for translation
# Define the function for translation which translated from English to Punjabi
def translate_to_punjabi(english_text):
    # Create the  prompt
    translate_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
    
    ### Instruction:
    {}
    
    ### Input:
    {}
    
    ### Response:
    {}"""
    
    # Format the prompt
    formatted_input = translate_prompt.format(
        "You are given the english text, read it and understand it. After reading translate the english text to Punjabi and provide the output strictly",  # Instruction
        english_text,  # Input text to be translated
        ""  # Output - leave blank for generation
    )
    
    # Tokenize the input
    inputs = tokenizer([formatted_input], return_tensors="pt").to("cuda")

    # Generate the translation output
    output_ids = model.generate(**inputs, max_new_tokens=500)

    # Decode the output
    translated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    fulloutput = translated_text.split("Response:")[-1].strip()
    if not fulloutput:
        fulloutput = ""
    return fulloutput


english_text = """
Delhi is a beautiful place
"""

punjabi_translation = translate_to_punjabi(english_text)

print(punjabi_translation)

📚 Documentation

Model and Data Information

Property	Details
Model Type	Based on LLaMA 3.1 8B with BF16 precision
Training Data	10 million English<>Punjabi parallel sentences from AI4Bharat's Bharat Parallel Corpus Collection (BPCC)
Evaluation Data	Evaluated on 1503 samples from the IN22 - Conv dataset via IndicTrans2
Score (chrF++)	Achieved a chrF++ score of 28.1 on the IN22 - Conv dataset

GPU Requirements for Inference

To perform inference with this model, here are the minimum GPU requirements:

Memory Requirements: 16 - 18 GB of VRAM for inference in BF16 (BFloat16) precision.
Recommended GPUs:
- NVIDIA A100 (20GB): Ideal for BF16 precision and efficiently handles large models like LLaMA 8B.
- Other GPUs with at least 16 GB VRAM may also work, but performance may vary based on memory availability.

Notes

⚠️ Important Note

The translation function is designed to handle English to Punjabi translations. You can use this for various applications, such as translating judicial documents, government orders, and other documents into Punjabi.

Performance and Future Work

As this is the first release of the LLAMA-VaaniSetu-EN2PA model, there is room for improvement, particularly in increasing the chrF++ score. Future versions of the model will focus on optimizing performance, enhancing the translation quality, and expanding to additional domains.

Stay tuned for updates, and feel free to contribute or raise issues on Hugging Face or the associated repositories!

Resources

Training Data: Bharat Parallel Corpus Collection (BPCC) by AI4Bharat.
Evaluation Data: IN22 - Conv dataset.

👥 Contributors

Rohit Anurag - Principal Software Engineer, PerpetualBlock - A Partex Company

🙏 Acknowledgements

AI4Bharat: The training and evaluation data we took from.

📄 License

This model is licensed under the appropriate terms for the LLaMA architecture and any datasets used during fine - tuning.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご