đ đĻđ LLAMA-VaaniSetu-EN2PA: English to Punjabi Translation with Large Language Models
This model, LLAMA-VaaniSetu-EN2PA, is a fine - tuned version of the LLaMA 3.1 8B architecture. It's specifically crafted for English to Punjabi translation. Trained on the Bharat Parallel Corpus Collection (BPCC) with about 10 million English<>Punjabi pairs from AI4Bharat, it aims to fill the gap in open - source English to Punjabi translation models and can be used for translating various documents for Punjabi - speaking people.
⨠Features
- Targeted Translation: Specialized for English to Punjabi translation.
- Large - scale Training: Utilizes 10 million parallel English - Punjabi sentences from BPCC.
- Potential Applications: Ideal for translating judicial documents, government orders, court judgments, etc.
đĻ Installation
Requirements
- Python 3.8.10 or above
- Required Python packages:
transformers
torch
huggingface_hub
Installation Instructions
To use this model, make sure you have the following dependencies installed:
pip install torch transformers huggingface_hub
đģ Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def load_model():
tokenizer = AutoTokenizer.from_pretrained("partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA")
model = AutoModelForCausalLM.from_pretrained(
"partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA",
torch_dtype=torch.bfloat16,
device_map="auto",
)
return model, tokenizer
model, tokenizer = load_model()
def translate_to_punjabi(english_text):
translate_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
formatted_input = translate_prompt.format(
"You are given the english text, read it and understand it. After reading translate the english text to Punjabi and provide the output strictly",
english_text,
""
)
inputs = tokenizer([formatted_input], return_tensors="pt").to("cuda")
output_ids = model.generate(**inputs, max_new_tokens=500)
translated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
fulloutput = translated_text.split("Response:")[-1].strip()
if not fulloutput:
fulloutput = ""
return fulloutput
english_text = """
Delhi is a beautiful place
"""
punjabi_translation = translate_to_punjabi(english_text)
print(punjabi_translation)
đ Documentation
Model and Data Information
Property |
Details |
Model Type |
Based on LLaMA 3.1 8B with BF16 precision |
Training Data |
10 million English<>Punjabi parallel sentences from AI4Bharat's Bharat Parallel Corpus Collection (BPCC) |
Evaluation Data |
Evaluated on 1503 samples from the IN22 - Conv dataset via IndicTrans2 |
Score (chrF++) |
Achieved a chrF++ score of 28.1 on the IN22 - Conv dataset |
GPU Requirements for Inference
To perform inference with this model, here are the minimum GPU requirements:
- Memory Requirements: 16 - 18 GB of VRAM for inference in BF16 (BFloat16) precision.
- Recommended GPUs:
- NVIDIA A100 (20GB): Ideal for BF16 precision and efficiently handles large models like LLaMA 8B.
- Other GPUs with at least 16 GB VRAM may also work, but performance may vary based on memory availability.
Notes
â ī¸ Important Note
The translation function is designed to handle English to Punjabi translations. You can use this for various applications, such as translating judicial documents, government orders, and other documents into Punjabi.
Performance and Future Work
As this is the first release of the LLAMA-VaaniSetu-EN2PA model, there is room for improvement, particularly in increasing the chrF++ score. Future versions of the model will focus on optimizing performance, enhancing the translation quality, and expanding to additional domains.
Stay tuned for updates, and feel free to contribute or raise issues on Hugging Face or the associated repositories!
Resources
đĨ Contributors
- Rohit Anurag - Principal Software Engineer, PerpetualBlock - A Partex Company
đ Acknowledgements
- AI4Bharat: The training and evaluation data we took from.
đ License
This model is licensed under the appropriate terms for the LLaMA architecture and any datasets used during fine - tuning.