đ Japanese to Korean Translator
This is a Japanese to Korean translator model. It's built upon the EncoderDecoderModel, combining [bert - japanese](https://huggingface.co/cl - tohoku/bert - base - japanese) and [kogpt2](https://github.com/SKT - AI/KoGPT2). This model effectively addresses the need for translating Japanese text into Korean, offering a practical solution for language - related tasks.
đ Quick Start
Demo
You can visit the demo at https://huggingface.co/spaces/sappho192/aihub - ja - ko - translator - demo.
⨠Features
- Model Architecture: Utilizes the
EncoderDecoderModel
with bert - japanese
as the encoder and kogpt2
as the decoder.
- Language Pair: Specialized in Japanese to Korean translation.
đĻ Installation
Dependencies (PyPI)
- torch
- transformers
- fugashi
- unidic - lite
đģ Usage Examples
Basic Usage
from transformers import(
EncoderDecoderModel,
PreTrainedTokenizerFast,
BertJapaneseTokenizer,
)
import torch
encoder_model_name = "cl - tohoku/bert - base - japanese - v2"
decoder_model_name = "skt/kogpt2 - base - v2"
src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name)
trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name)
model = EncoderDecoderModel.from_pretrained("sappho192/aihub - ja - ko - translator")
text = "åããžããĻããããããéĄãããžãã"
def translate(text_src):
embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt')
embeddings = {k: v for k, v in embeddings.items()}
output = model.generate(**embeddings, max_length=500)[0, 1:-1]
text_trg = trg_tokenizer.decode(output.cpu())
return text_trg
print(translate(text))
đ Documentation
Dataset
This model uses datasets from 'The Open AI Dataset Project (AI - Hub, South Korea)'. All data information can be accessed through 'AI - Hub (aihub.or.kr)'.
â ī¸ Important Note
In order for a corporation, organization, or individual located outside of Korea to use AI data, etc., a separate agreement is required with the performing organization and the Korea National Information Society agency(NIA). In order to export AI data, etc. outside the country, a separate agreement is required with the performing organization and the NIA. Link
Dataset list
The dataset used to train the model is merged from the following sub - datasets:
-
- Everyday life and colloquial Korean - Chinese, Korean - Japanese translation parallel corpus data [Link]
-
- Korean - multilingual (excluding English) translation corpus (science and technology) [Link]
-
- Korean - multilingual translation corpus (basic science) [Link]
-
- Korean - multilingual translation corpus (humanities) [Link]
- Korean - Japanese translation corpus [Link]
To reproduce the merged dataset, you can use the code at the following link:
https://github.com/sappho192/aihub - translation - dataset
đ License
This project is licensed under the MIT license.