🚀 Model Card for Taigi-Llama-2-Translator-7B
The Taigi-Llama-2-Translator series is a translation model designed for Taiwanese Hokkien and related languages. It's built upon the Taigi-Llama-2 series model and fine - tuned on 263k parallel data.
🚀 Quick Start
For more details about this model, you can refer to our GitHub repository and the paper: Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems. You can also explore other models and datasets in the Taiwanese Hokkien LLM collection.
✨ Features
- Base Model: Bohanlu/Taigi-Llama-2-7B
- Usage: This model can perform translations between Traditional Chinese or English and Taiwanese Hokkien (Hanzi, POJ, or Hanlo). It also supports translations among different scripts of Taiwanese Hokkien (Hanzi, POJ, Hanlo).
- Language(s) (NLP): Taiwanese Hokkien (Hanzi, POJ and Hanlo), Traditional Chinese and English
- Input: Text in the source language
- Output: Text in the target language
- Model Size: 7B parameters
📦 Installation
No specific installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
The prompt template for this model is as follows:
{BOS}[TRANS]\n{source_sentence}\n[/TRANS]\n[{target_language}]\n
source_sentence
: The sentence you want to translate.
target_language
: The target language you want to translate to. Use "ZH" for Traditional Chinese, "EN" for English, "POJ" for Taiwanese Hokkien POJ, "HL" for Taiwanese Hokkien Hanlo, and "HAN" for Taiwanese Hokkien Hanzi.
- Ensure there's a newline at the end.
Advanced Usage
Here is a Python code example demonstrating how to use the model for translation:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextGenerationPipeline
import torch
import accelerate
def get_pipeline(path:str, tokenizer:AutoTokenizer, accelerator:accelerate.Accelerator) -> TextGenerationPipeline:
model = AutoModelForCausalLM.from_pretrained(
path, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True)
terminators = [tokenizer.eos_token_id, tokenizer.pad_token_id]
pipeline = TextGenerationPipeline(model = model, tokenizer = tokenizer, num_workers=accelerator.state.num_processes*4, pad_token_id=tokenizer.pad_token_id, eos_token_id=terminators)
return pipeline
model_dir = "Bohanlu/Taigi-Llama-2-Translator-7B"
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)
accelerator = accelerate.Accelerator()
pipe = get_pipeline(model_dir, tokenizer, accelerator)
PROMPT_TEMPLATE = "[TRANS]\n{source_sentence}\n[/TRANS]\n[{target_language}]\n"
def translate(source_sentence:str, target_language:str) -> str:
prompt = PROMPT_TEMPLATE.format(source_sentence=source_sentence, target_language=target_language)
out = pipe(prompt, return_full_text=False, repetition_penalty=1.1, do_sample=False)[0]['generated_text']
return out[:out.find("[/")].strip()
source_sentence = "How are you today?"
print("To Hanzi: " + translate(source_sentence, "HAN"))
print("To POJ: " + translate(source_sentence, "POJ"))
print("To Traditional Chinese: " + translate(source_sentence, "ZH"))
print("To Hanlo: " + translate(source_sentence, "HL"))
📚 Documentation
No additional detailed documentation is provided in the original document, so this section is skipped.
🔧 Technical Details
No technical implementation details are provided in the original document, so this section is skipped.
📄 License
The model is licensed under cc-by-nc-sa-4.0.
📖 Citation
If you find the resources in the Taiwanese Hokkien LLM collection useful in your work, please cite it using the following reference:
@misc{lu2024enhancing,
title={Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems},
author={Bo-Han Lu and Yi-Hsuan Lin and En-Shiun Annie Lee and Richard Tzong-Han Tsai},
year={2024},
eprint={2403.12024},
archivePrefix={arXiv},
primaryClass={cs.CL}
}