Taigi-Llama-2-Translator-7B Open-source Translation Model - Freely实现Translation between Taigi, Traditional Chinese, and English

Taigi Llama 2 Translator 7B

Developed by Bohanlu

Built based on the Taiwanese-Llama-2 series of models, focusing on translation tasks between Taiwanese Southern Min, Traditional Chinese, and English.

Machine Translation

Transformers

#Translation of multiple written forms of Southern Min #Mutual translation between Chinese, English, and Taiwanese #Conversion between Chinese characters and Pe̍h-ōe-jī

Downloads 1,915

Release Time : 5/13/2024

Model Overview

This model is fine-tuned on 263k parallel data and supports mutual translation between Taiwanese Southern Min (Chinese characters, Pe̍h-ōe-jī, Chinese with Pe̍h-ōe-jī), Traditional Chinese, and English.

Model Features

Multilingual translation

Supports translation between Traditional Chinese or English and Taiwanese Southern Min (Chinese characters, Pe̍h-ōe-jī, Chinese with Pe̍h-ōe-jī), and also supports conversion between different writing systems of Taiwanese Southern Min.

Support for multiple writing systems

Supports three writing forms of Taiwanese Southern Min: Chinese characters (HAN), Pe̍h-ōe-jī (POJ), and Chinese with Pe̍h-ōe-jī (HL).

Large-scale training data

Fine-tuned based on 263k parallel data to ensure translation quality.

Model Capabilities

Text translation

Multilingual conversion

Writing system conversion

Use Cases

Language translation

Translation from English to Taiwanese Southern Min

Translate English text into different writing forms of Taiwanese Southern Min

How are you today? → 你今仔日好無？(Chinese characters)

Translation from Traditional Chinese to Taiwanese Southern Min

Translate Traditional Chinese text into different writing forms of Taiwanese Southern Min

Writing system conversion

Conversion from Chinese characters to Pe̍h-ōe-jī

Convert the Chinese character form of Taiwanese Southern Min into the Pe̍h-ōe-jī form

🚀 Model Card for Taigi-Llama-2-Translator-7B

The Taigi-Llama-2-Translator series is a translation model designed for Taiwanese Hokkien and related languages. It's built upon the Taigi-Llama-2 series model and fine - tuned on 263k parallel data.

🚀 Quick Start

For more details about this model, you can refer to our GitHub repository and the paper: Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems. You can also explore other models and datasets in the Taiwanese Hokkien LLM collection.

✨ Features

Base Model: Bohanlu/Taigi-Llama-2-7B
Usage: This model can perform translations between Traditional Chinese or English and Taiwanese Hokkien (Hanzi, POJ, or Hanlo). It also supports translations among different scripts of Taiwanese Hokkien (Hanzi, POJ, Hanlo).
Language(s) (NLP): Taiwanese Hokkien (Hanzi, POJ and Hanlo), Traditional Chinese and English
Input: Text in the source language
Output: Text in the target language
Model Size: 7B parameters

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The prompt template for this model is as follows:

{BOS}[TRANS]\n{source_sentence}\n[/TRANS]\n[{target_language}]\n

source_sentence: The sentence you want to translate.
target_language: The target language you want to translate to. Use "ZH" for Traditional Chinese, "EN" for English, "POJ" for Taiwanese Hokkien POJ, "HL" for Taiwanese Hokkien Hanlo, and "HAN" for Taiwanese Hokkien Hanzi.
Ensure there's a newline at the end.

Advanced Usage

Here is a Python code example demonstrating how to use the model for translation:

from transformers import AutoModelForCausalLM, AutoTokenizer, TextGenerationPipeline
import torch
import accelerate

def get_pipeline(path:str, tokenizer:AutoTokenizer, accelerator:accelerate.Accelerator) -> TextGenerationPipeline:
    model = AutoModelForCausalLM.from_pretrained(
        path, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True)
    
    terminators = [tokenizer.eos_token_id, tokenizer.pad_token_id]

    pipeline = TextGenerationPipeline(model = model, tokenizer = tokenizer, num_workers=accelerator.state.num_processes*4, pad_token_id=tokenizer.pad_token_id, eos_token_id=terminators)

    return pipeline

model_dir = "Bohanlu/Taigi-Llama-2-Translator-7B" # or "Bohanlu/Taigi-Llama-2-Translator-13B" for the 13B model
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)

accelerator = accelerate.Accelerator()
pipe = get_pipeline(model_dir, tokenizer, accelerator)

PROMPT_TEMPLATE = "[TRANS]\n{source_sentence}\n[/TRANS]\n[{target_language}]\n"

def translate(source_sentence:str, target_language:str) -> str:
    prompt = PROMPT_TEMPLATE.format(source_sentence=source_sentence, target_language=target_language)
    out = pipe(prompt, return_full_text=False, repetition_penalty=1.1, do_sample=False)[0]['generated_text']
    return out[:out.find("[/")].strip()

source_sentence = "How are you today？"

print("To Hanzi: " + translate(source_sentence, "HAN"))
# Output: To Hanzi: 你今仔日好無？

print("To POJ: " + translate(source_sentence, "POJ"))
# Output: To POJ: Lí kin-á-ji̍t án-chóaⁿ?

print("To Traditional Chinese: " + translate(source_sentence, "ZH"))
# Output: To Traditional Chinese: 你今天好嗎？

print("To Hanlo: " + translate(source_sentence, "HL"))
# Output: To Hanlo: 你今仔日好無？

📚 Documentation

No additional detailed documentation is provided in the original document, so this section is skipped.

🔧 Technical Details

No technical implementation details are provided in the original document, so this section is skipped.

📄 License

The model is licensed under cc-by-nc-sa-4.0.

📖 Citation

If you find the resources in the Taiwanese Hokkien LLM collection useful in your work, please cite it using the following reference:

@misc{lu2024enhancing,
      title={Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems}, 
      author={Bo-Han Lu and Yi-Hsuan Lin and En-Shiun Annie Lee and Richard Tzong-Han Tsai},
      year={2024},
      eprint={2403.12024},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご