๐ Japanese to Korean translator for FFXIV
This project provides a Japanese to Korean translator specifically designed for FFXIV. It utilizes transformer models and offers both PyTorch and ONNX-based inference methods.
๐ Quick Start
This project is detailed on the Github repo.
โจ Features
- Translation Pipeline: Specialized for translating Japanese text to Korean in the context of FFXIV.
- Multiple Inference Methods: Supports both PyTorch and Optimum.OnnxRuntime for inference.
- Training Notebook: A training notebook is provided for further model customization.
๐ฆ Installation
The README does not provide specific installation steps, so this section is skipped.
๐ป Usage Examples
Basic Usage
Inference (PyTorch)
from transformers import(
EncoderDecoderModel,
PreTrainedTokenizerFast,
BertJapaneseTokenizer,
)
import torch
encoder_model_name = "cl-tohoku/bert-base-japanese-v2"
decoder_model_name = "skt/kogpt2-base-v2"
src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name)
trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name)
model = EncoderDecoderModel.from_pretrained("./best_model")
text = "ใฎใซใฌใกใใทใฅ่จไผๆฆ"
def translate(text_src):
embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt')
embeddings = {k: v for k, v in embeddings.items()}
output = model.generate(**embeddings, max_length=500)[0, 1:-1]
text_trg = trg_tokenizer.decode(output.cpu())
return text_trg
print(translate(text))
Inference (Optimum.OnnxRuntime)
Note that current Optimum.OnnxRuntime still requires PyTorch for backend. [Issue]
You can use either [ONNX] or [quantized ONNX] model.
from transformers import BertJapaneseTokenizer,PreTrainedTokenizerFast
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from onnxruntime import SessionOptions
import torch
encoder_model_name = "cl-tohoku/bert-base-japanese-v2"
decoder_model_name = "skt/kogpt2-base-v2"
src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name)
trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name)
sess_options = SessionOptions()
sess_options.log_severity_level = 3
model = ORTModelForSeq2SeqLM.from_pretrained("sappho192/ffxiv-ja-ko-translator",
sess_options=sess_options, subfolder="onnx")
texts = [
"้ใใ!",
"ๅใใพใใฆ.",
"ใใใใใ้กใใใพใ.",
"ใฎใซใฌใกใใทใฅ่จไผๆฆ",
"ใฎใซใฌใกใใทใฅ่จไผๆฆใซ่กใฃใฆใใพใใไธ็ทใซ่กใใพใใใใ๏ผ",
"ๅคใซใชใใพใใ",
"ใ้ฃฏใ้ฃในใพใใใ."
]
def translate(text_src):
embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt')
print(f'Src tokens: {embeddings.data["input_ids"]}')
embeddings = {k: v for k, v in embeddings.items()}
output = model.generate(**embeddings, max_length=500)[0, 1:-1]
print(f'Trg tokens: {output}')
text_trg = trg_tokenizer.decode(output.cpu())
return text_trg
for text in texts:
print(translate(text))
print()
Advanced Usage
Training
Check the training.ipynb.
๐ Documentation
Demo
Click to try demo
Check this Windows app demo with ONNX model
๐ License
This project is licensed under the MIT license.
Property |
Details |
Model Type |
Transformer-based encoder-decoder model |
Training Data |
Helsinki-NLP/tatoeba_mt, sappho192/Tatoeba-Challenge-jpn-kor |
Languages |
Japanese, Korean |
Pipeline Tag |
Translation |
Tags |
python, transformer, pytorch |
Inference |
false |
โ ๏ธ Important Note
FINAL FANTASY is a registered trademark of Square Enix Holdings Co., Ltd.