Llama-3.1-8B-VaaniSetu-EN2PA開源翻譯模型 - 免費實現英語到旁遮普語精準翻譯

首頁

Llama 3.1 8B VaaniSetu EN2PA

由partex-nv開發

基於LLaMA 3.1 8B架構微調的英語到旁遮普語翻譯模型，使用Bharat平行語料庫訓練。

機器翻譯

Safetensors

支持多種語言#英語-旁遮普語翻譯 #司法文檔翻譯 #LLaMA3.1微調

下載量 48

發布時間 : 9/25/2024

模型概述

該模型專為英語到旁遮普語翻譯設計，適用於翻譯司法文件、政府命令等文檔，服務於旁遮普語使用者。

模型特點

高質量翻譯

使用1000萬條英語<>旁遮普語平行句對訓練，提供高質量的翻譯結果。

開源模型

填補了開源英語到旁遮普語翻譯模型的空白。

專業領域適用

特別適用於司法文件、政府命令等專業文檔的翻譯。

模型能力

英語到旁遮普語翻譯

文本生成

使用案例

文檔翻譯

司法文件翻譯

將英語司法文件翻譯為旁遮普語。

政府命令翻譯

將英語政府命令翻譯為旁遮普語。

🚀 🦙📝 LLAMA-VaaniSetu-EN2PA：利用大語言模型實現英語到旁遮普語的翻譯

本項目 LLAMA-VaaniSetu-EN2PA 是一款專為英語到旁遮普語翻譯而設計的模型。它基於大語言模型技術，對 LLaMA 3.1 8B 架構模型進行了微調訓練，致力於填補開源英語到旁遮普語翻譯模型的空白，可廣泛應用於司法文件、政府命令、法院判決等各類文件的翻譯，為旁遮普語使用者提供便利。

🚀 快速開始

本模型 LLAMA-VaaniSetu-EN2PA 是 LLaMA 3.1 8B 架構模型的微調版本，專門用於 英語到旁遮普語的翻譯。該模型使用了 印度平行語料庫集合（BPCC） 進行訓練，其中包含約 1000 萬對英語<>旁遮普語句子對。這個語料庫由 AI4Bharat 提供。

該模型旨在填補 開源英語到旁遮普語翻譯模型 的空白，可用於翻譯司法文件、政府命令、法院判決等各類文件，以滿足旁遮普語使用者的需求。

✨ 主要特性

針對性強：專門針對英語到旁遮普語的翻譯任務進行微調，能更好地處理該特定領域的翻譯需求。
數據豐富：使用包含約 1000 萬對英語<>旁遮普語句子對的 BPCC 語料庫進行訓練，為模型提供了豐富的語言知識。
應用廣泛：可應用於司法、政府等多個領域的文件翻譯，具有較高的實用價值。

📦 安裝指南

環境要求

Python 3.8.10 或更高版本
需要安裝的 Python 包：
- transformers
- torch
- huggingface_hub

安裝命令

pip install torch transformers huggingface_hub

💻 使用示例

基礎用法

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


# Load model and tokenizer
def load_model():
    tokenizer = AutoTokenizer.from_pretrained("partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA")
    model = AutoModelForCausalLM.from_pretrained(
        "partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA",
        torch_dtype=torch.bfloat16,
        device_map="auto",  # Automatically moves model to GPU
    )
    return model, tokenizer

model, tokenizer = load_model()

# Define the function for translation
# Define the function for translation which translated from English to Punjabi
def translate_to_punjabi(english_text):
    # Create the  prompt
    translate_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
    
    ### Instruction:
    {}
    
    ### Input:
    {}
    
    ### Response:
    {}"""
    
    # Format the prompt
    formatted_input = translate_prompt.format(
        "You are given the english text, read it and understand it. After reading translate the english text to Punjabi and provide the output strictly",  # Instruction
        english_text,  # Input text to be translated
        ""  # Output - leave blank for generation
    )
    
    # Tokenize the input
    inputs = tokenizer([formatted_input], return_tensors="pt").to("cuda")

    # Generate the translation output
    output_ids = model.generate(**inputs, max_new_tokens=500)

    # Decode the output
    translated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    fulloutput = translated_text.split("Response:")[-1].strip()
    if not fulloutput:
        fulloutput = ""
    return fulloutput


english_text = """
Delhi is a beautiful place
"""

punjabi_translation = translate_to_punjabi(english_text)

print(punjabi_translation)

📚 詳細文檔

模型和數據信息

屬性	詳情
模型類型	基於 LLaMA 3.1 8B 架構，採用 BF16 精度
訓練數據	來自 AI4Bharat 的印度平行語料庫集合（BPCC）的 1000 萬對英語<>旁遮普語平行句子
評估數據	該模型在 IN22 - Conv 數據集的 1503 個樣本上進行了評估，該數據集也可通過 IndicTrans2 獲取
評估指標（chrF++）	在 IN22 - Conv 數據集上達到了 28.1 的 chrF++ 分數，對於開源模型來說是一個不錯的成績

GPU 推理要求

要使用此模型進行推理，需要滿足以下 最低 GPU 要求：

內存要求：在 BF16（BFloat16）精度 下進行推理時，需要 16 - 18 GB 的 VRAM。
推薦 GPU：
- NVIDIA A100（20GB）：非常適合 BF16 精度，能夠高效處理像 LLaMA 8B 這樣的大型模型。
- 其他至少具有 16 GB VRAM 的 GPU 也可以使用，但性能可能會因內存可用性而有所不同。

注意事項

⚠️ 重要提示

翻譯函數僅用於處理 英語到旁遮普語 的翻譯。你可以將其用於各種應用場景，例如將司法文件、政府命令等各類文件翻譯成旁遮普語。

💡 使用建議

由於這是 LLAMA-VaaniSetu-EN2PA 模型的首次發佈，仍有改進空間，特別是在提高 chrF++ 分數方面。未來版本的模型將專注於優化性能、提升翻譯質量，並拓展到更多領域。請關注後續更新，並歡迎在 Hugging Face 或相關倉庫中貢獻代碼或提出問題！