Exaone3 Instructrans V2 Enko 7.8b
An English-Korean translation model based on exaone-3-7.8B-it training, focused on translation tasks of instruction datasets
Machine Translation
Transformers Supports Multiple Languages#English-Korean Translation#Instruction Fine-tuning#Large Language Model

Downloads 45
Release Time : 8/24/2024
Model Overview
This model is specifically optimized for English-to-Korean translation tasks through instruction fine-tuning, trained on multiple English-Korean translation datasets, supporting high-quality text generation and translation tasks
Model Features
High-quality English-Korean Translation
Trained on multiple professional translation datasets, providing accurate English-Korean bidirectional translation capabilities
Instruction Optimization
Specially optimized for instruction-based translation tasks, capable of understanding and executing complex translation instructions
Large Context Support
Supports context windows of up to 8192 tokens, suitable for long-text translation
Model Capabilities
English to Korean Translation
Korean to English Translation
Text Generation
Instruction Understanding
Use Cases
Professional Translation
Technical Document Translation
Accurately translate English technical documents into Korean
Excellent performance in translating technical terminology
News Translation
Translate English news reports into Korean
Localization while preserving the original semantics and style
Educational Applications
Language Learning Assistance
Provide high-quality translation references for language learners
Helps learners understand complex expressions
๐ instructTrans-v2
instructTrans-v2 is a powerful model for English-to-Korean translation, trained on multiple high - quality translation datasets.
๐ Quick Start
Generating Text
This model supports translation from English to Korean. To translate text, use the following Python code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Translation-EnKo/exaone3-instrucTrans-v2-enko-7.8b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16
)
system_prompt="๋น์ ์ ๋ฒ์ญ๊ธฐ ์
๋๋ค. ์์ด๋ฅผ ํ๊ตญ์ด๋ก ๋ฒ์ญํ์ธ์."
sentence = "The aerospace industry is a flower in the field of technology and science."
conversation = [{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': sentence}]
inputs = tokenizer.apply_chat_template(
conversation,
tokenize=True,
add_generation_prompt=True,
return_tensors='pt'
).to("cuda")
outputs = model.generate(inputs, max_new_tokens=4096) # Finetuned with length 8192
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))
Inference with vLLM
Expand/Collapse Inference Code
# Requires at least a 24 GB Vram GPU. If you have 12GB VRAM, you will need to run in FP8 mode.
python vllm_inference.py -gpu_id 0 -split_idx 0 -split_num 2 -dname "nvidia/HelpSteer" -untrans_col 'helpfulness' 'correctness' 'coherence' 'complexity' 'verbosity' > 0.out
python vllm_inference.py -gpu_id 1 -split_idx 1 -split_num 2 -dname "nvidia/HelpSteer" -untrans_col 'helpfulness' 'correctness' 'coherence' 'complexity' 'verbosity' > 1.out
import os
import argparse
import pandas as pd
from tqdm import tqdm
from typing import List, Dict
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# truncate sentences with more than 4096 tokens. # for same dataset size
def truncation_func(sample, column_name):
input_ids = tokenizer(str(sample[column_name]), truncation=True, max_length=4096, add_special_tokens=False).input_ids
output = tokenizer.decode(input_ids)
sample[column_name]=output
return sample
# convert to chat_template
def create_conversation(sample, column_name):
SYSTEM_PROMPT=f"๋น์ ์ ๋ฒ์ญ๊ธฐ ์
๋๋ค. ์์ด ๋ฌธ์ฅ์ ํ๊ตญ์ด๋ก ๋ฒ์ญํ์ธ์."
messages=[
{"role":"system", "content": SYSTEM_PROMPT},
{"role":"user", "content":sample[column_name]}
]
text=tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
sample[column_name]=text
return sample
def load_dataset_preprocess(dataset_name:str, untranslate_column:List, split_num, split_idx, subset=None, num_proc=128) -> Dataset:
step = 100//split_num # split datasets
if subset:
dataset = load_dataset(dataset_name, subset, split=f'train[{step*split_idx}%:{step*(split_idx+1)}%]')
else:
dataset = load_dataset(dataset_name, split=f'train[{step*split_idx}%:{step*(split_idx+1)}%]')
print(dataset)
original_dataset = dataset # To leave columns untranslated
dataset = dataset.remove_columns(untranslate_column)
for feature in dataset.features:
dataset = dataset.map(lambda x: truncation_func(x,feature), num_proc=num_proc) #
dataset = dataset.map(lambda x: create_conversation(x,feature), batched=False, num_proc=num_proc)
print("filtered_dataset:", dataset)
return dataset, original_dataset
def save_dataset(result_dict:Dict, dataset_name, untranslate_column:List, split_idx, subset:str):
for column in untranslate_column:
result_dict[column] = original_dataset[column]
df = pd.DataFrame(result_dict)
output_file_name = dataset_name.split('/')[-1]
os.makedirs('gen', exist_ok=True)
if subset:
save_path = f"gen/{output_file_name}_{subset}_{split_idx}.jsonl"
else:
save_path = f"gen/{output_file_name}_{split_idx}.jsonl"
df.to_json(save_path, lines=True, orient='records', force_ascii=False)
if __name__=="__main__":
model_name = "Translation-EnKo/exaone3-instrucTrans-v2-enko-7.8b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
parser = argparse.ArgumentParser(description='load dataset name & split size')
parser.add_argument('-dname', type=str, default="Magpie-Align/Magpie-Pro-MT-300K-v0.1")
parser.add_argument('-untrans_col', nargs='+', default=[])
parser.add_argument('-split_num', type=int, default=4)
parser.add_argument('-split_idx', type=int, default=0)
parser.add_argument('-gpu_id', type=int, default=0)
parser.add_argument('-subset', type=str, default=None)
parser.add_argument('-num_proc', type=int, default=128)
args = parser.parse_args()
os.environ["CUDA_VISIBLE_DEVICES"]=str(args.gpu_id)
dataset, original_dataset = load_dataset_preprocess(args.dname,
args.untrans_col,
args.split_num,
args.split_idx,
args.subset,
args.num_proc
)
# define model
sampling_params = SamplingParams(
temperature=0,
max_tokens=8192,
)
llm = LLM(
model=model_name,
tensor_parallel_size=1,
gpu_memory_utilization=0.95,
)
# inference model
result_dict = {}
for feature in tqdm(dataset.features):
print(f"'{feature}' column in progress..")
outputs = llm.generate(dataset[feature], sampling_params)
result_dict[feature]=[output.outputs[0].text for output in outputs]
save_dataset(result_dict, args.dname, args.untrans_col, args.split_idx, args.subset)
print(f"saved to json. column: {feature}")
โจ Features
The exaone3-instrucTrans-v2-enko-7.8b model is trained on translation datasets (English -> Korean) based on exaone - 3 - 7.8B - it to translate the English instruction dataset. The training datasets include:
- nayohan/aihub-en-ko-translation-12m
- nayohan/instruction_en_ko_translation_1.4m
- Translation-EnKo/trc_uniform_313k_eval_45_filtered
๐ Documentation
Result
# EVAL_RESULT (2405_KO_NEWS) (max_new_tokens=512)
"en_ref":"This controversy arose around a new advertisement for the latest iPad Pro that Apple released on YouTube on the 7th. The ad shows musical instruments, statues, cameras, and paints being crushed in a press, followed by the appearance of the iPad Pro in their place. It appears to emphasize the new iPad Pro's artificial intelligence features, advanced display, performance, and thickness. Apple mentioned that the newly unveiled iPad Pro is equipped with the latest 'M4' chip and is the thinnest device in Apple's history. The ad faced immediate backlash upon release, as it graphically depicts objects symbolizing creators being crushed. Critics argue that the imagery could be interpreted as technology trampling on human creators. Some have also voiced concerns that it evokes a situation where creators are losing ground due to AI."
"ko_ref":"์ด๋ฒ ๋
ผ๋์ ์ ํ์ด ์ง๋ 7์ผ ์ ํ๋ธ์ ๊ณต๊ฐํ ์ ํ ์์ดํจ๋ ํ๋ก ๊ด๊ณ ๋ฅผ ๋๋ฌ์ธ๊ณ ๋ถ๊ฑฐ์ก๋ค. ํด๋น ๊ด๊ณ ์์์ ์
๊ธฐ์ ์กฐ๊ฐ์, ์นด๋ฉ๋ผ, ๋ฌผ๊ฐ ๋ฑ์ ์์ฐฉ๊ธฐ๋ก ์ง๋๋ฅธ ๋ค ๊ทธ ์๋ฆฌ์ ์์ดํจ๋ ํ๋ก๋ฅผ ๋ฑ์ฅ์ํค๋ ๋ด์ฉ์ด์๋ค. ์ ํ ์์ดํจ๋ ํ๋ก์ ์ธ๊ณต์ง๋ฅ ๊ธฐ๋ฅ๋ค๊ณผ ์งํ๋ ๋์คํ๋ ์ด์ ์ฑ๋ฅ, ๋๊ป ๋ฑ์ ๊ฐ์กฐํ๊ธฐ ์ํ ์ทจ์ง๋ก ํ์ด๋๋ค. ์ ํ์ ์ด๋ฒ์ ๊ณต๊ฐํ ์์ดํจ๋ ํ๋ก์ ์ ํ โM4โ ์นฉ์ด ํ์ฌ๋๋ฉฐ ๋๊ป๋ ์ ํ์ ์ญ๋ ์ ํ ์ค ๊ฐ์ฅ ์๋ค๋ ์ค๋ช
๋ ๋ง๋ถ์๋ค. ๊ด๊ณ ๋ ๊ณต๊ฐ ์งํ ๊ฑฐ์ผ ๋นํ์ ์ง๋ฉดํ๋ค. ์ฐฝ์์๋ฅผ ์์งํ๋ ๋ฌผ๊ฑด์ด ์ง๋๋ ค์ง๋ ๊ณผ์ ์ ์ง๋์น๊ฒ ์ ๋๋ผํ๊ฒ ๋ฌ์ฌํ ์ ์ด ๋ฌธ์ ๊ฐ ๋๋ค. ๊ธฐ์ ์ด ์ธ๊ฐ ์ฐฝ์์๋ฅผ ์ง๋ฐ๋ ๋ชจ์ต์ ๋ฌ์ฌํ ๊ฒ์ผ๋ก ํด์๋ ์ฌ์ง๊ฐ ์๋ค๋ ๋ฌธ์ ์์์ด๋ค. ์ธ๊ณต์ง๋ฅ(AI)์ผ๋ก ์ธํด ์ฐฝ์์๊ฐ ์ค ์๋ฆฌ๊ฐ ์ค์ด๋๋ ์ํฉ์ ์ฐ์์ํจ๋ค๋ ๋ชฉ์๋ฆฌ๋ ๋์๋ค."
"exaone3-InstrucTrans-v2":"์ด๋ฒ ๋
ผ๋์ ์ ํ์ด ์ง๋ 7์ผ ์ ํ๋ธ์ ๊ณต๊ฐํ ์ต์ ํ ์์ดํจ๋ ํ๋ก์ ์ ๊ด๊ณ ๋ฅผ ๋๋ฌ์ธ๊ณ ๋ถ๊ฑฐ์ก๋ค. ์ด ๊ด๊ณ ๋ ์
๊ธฐ, ์กฐ๊ฐ์, ์นด๋ฉ๋ผ, ๋ฌผ๊ฐ ๋ฑ์ ์์ฐฉ๊ธฐ๋ก ์ง๋๋ฅธ ๋ค ๊ทธ ์๋ฆฌ์ ์์ดํจ๋ ํ๋ก๋ฅผ ๋ฑ์ฅ์ํค๋ ์ฅ๋ฉด์ ๋ณด์ฌ์ค๋ค. ์๋ก์ด ์์ดํจ๋ ํ๋ก์ ์ธ๊ณต์ง๋ฅ ๊ธฐ๋ฅ, ์ฒจ๋จ ๋์คํ๋ ์ด, ์ฑ๋ฅ, ๋๊ป๋ฅผ ๊ฐ์กฐํ๋ ๊ฒ์ผ๋ก ๋ณด์ธ๋ค. ์ ํ์ ์ด๋ฒ์ ๊ณต๊ฐ๋ ์์ดํจ๋ ํ๋ก์ ์ต์ 'M4' ์นฉ์ด ํ์ฌ๋์ผ๋ฉฐ, ์ ํ ์ญ์ฌ์ ๊ฐ์ฅ ์์ ๋๊ป๋ฅผ ์๋ํ๋ค๊ณ ์ธ๊ธํ๋ค. ์ด ๊ด๊ณ ๋ ๊ณต๊ฐ๋์๋ง์ ํฌ๋ฆฌ์์ดํฐ๋ฅผ ์์งํ๋ ์ฌ๋ฌผ๋ค์ด ์ง๋ฐํ๋ ์ฅ๋ฉด์ ๊ทธ๋ํฝ์ผ๋ก ํํํด ์ฆ๊ฐ์ ์ธ ๋ฐ๋ฐ์ ๋ถ๋ชํ๋ค. ๋นํ๊ฐ๋ค์ ์ด ์ด๋ฏธ์ง๊ฐ ๊ธฐ์ ์ด ์ธ๊ฐ ํฌ๋ฆฌ์์ดํฐ๋ฅผ ์ง๋ฐ๋ ๊ฒ์ผ๋ก ํด์๋ ์ ์๋ค๊ณ ์ฃผ์ฅํ๋ค. ์ผ๋ถ์์๋ AI๋ก ์ธํด ํฌ๋ฆฌ์์ดํฐ๋ค์ด ์ค ์๋ฆฌ๋ฅผ ์๋ ์ํฉ์ ์ฐ์์ํจ๋ค๋ ์ฐ๋ ค์ ๋ชฉ์๋ฆฌ๋ ๋์๋ค."
"llama3-InstrucTrans":"์ด๋ฒ ๋
ผ๋์ ์ ํ์ด ์ง๋ 7์ผ ์ ํ๋ธ์ ๊ณต๊ฐํ ์ต์ ์์ดํจ๋ ํ๋ก ๊ด๊ณ ๋ฅผ ์ค์ฌ์ผ๋ก ๋ถ๊ฑฐ์ก๋ค. ์ด ๊ด๊ณ ๋ ์
๊ธฐ, ์กฐ๊ฐ์, ์นด๋ฉ๋ผ, ๋ฌผ๊ฐ ๋ฑ์ ๋๋ฅด๊ธฐ ์์ํ๋ ์ฅ๋ฉด๊ณผ ํจ๊ป ๊ทธ ์๋ฆฌ์ ์์ดํจ๋ ํ๋ก๊ฐ ๋ฑ์ฅํ๋ ์ฅ๋ฉด์ ๋ณด์ฌ์ค๋ค. ์ด๋ ์๋ก์ด ์์ดํจ๋ ํ๋ก์ ์ธ๊ณต์ง๋ฅ ๊ธฐ๋ฅ, ๊ณ ๊ธ ๋์คํ๋ ์ด, ์ฑ๋ฅ, ๋๊ป๋ฅผ ๊ฐ์กฐํ๋ ๊ฒ์ผ๋ก ๋ณด์ธ๋ค. ์ ํ์ ์ด๋ฒ์ ๊ณต๊ฐํ ์์ดํจ๋ ํ๋ก์ ์ต์ 'M4' ์นฉ์ด ํ์ฌ๋์ผ๋ฉฐ, ์ ํ ์ญ์ฌ์ ๊ฐ์ฅ ์์ ๊ธฐ๊ธฐ๋ผ๊ณ ์ธ๊ธํ๋ค. ์ด ๊ด๊ณ ๋ ์ถ์ํ์๋ง์ ํฌ๋ฆฌ์์ดํฐ๋ฅผ ์์งํ๋ ๋ฌผ๊ฑด์ด ํ์๋๋ ์ฅ๋ฉด์ด ๊ทธ๋๋ก ๊ทธ๋ ค์ ธ ๋
ผ๋์ด ๋๊ณ ์๋ค. ๋นํ๊ฐ๋ค์ ์ด ์ด๋ฏธ์ง๊ฐ ๊ธฐ์ ์ด ์ธ๊ฐ ํฌ๋ฆฌ์์ดํฐ๋ฅผ ์ง๋ฐ๋๋ค๋ ์๋ฏธ๋ก ํด์๋ ์ ์๋ค๊ณ ์ฃผ์ฅํ๋ค. ๋ํ AI๋ก ์ธํด ํฌ๋ฆฌ์์ดํฐ๋ค์ด ๋ฐ๋ฆฌ๊ณ ์๋ค๋ ์ํฉ์ ์ฐ์์ํจ๋ค๋ ์ฐ๋ ค์ ๋ชฉ์๋ฆฌ๋ ๋์จ๋ค."
Evaluation Result
A dataset was selected to evaluate the English - to - Korean translation performance.
Evaluation Dataset Sources
- Aihub/FLoRes: traintogpb/aihub-flores-koen-integrated-sparta-30k | (test set 1k)
- iwslt - 2023: shreevigneshs/iwslt-2023-en-ko-train-val-split-0.1 | (f_test 597, if_test 597)
- ko_news_2024: nayohan/ko_news_eval40 | (40)
Model Evaluation Method
- In this evaluation, inference was performed using vLLM for evaluation, which is different from the previous (hf) method. (Common: max_new_tokens = 512)
- The detailed evaluation content followed the results of the existing instruct - Trans. [Link]
Average
Using vLLM resulted in lower overall scores compared to HF.
Performance Comparison by Model
Model Name | AIHub | Flores | IWSLT | News | Average |
---|---|---|---|---|---|
Meta - Llama | |||||
meta - llama/Meta - Llama - 3 - 8B - Instruct | 0.3075 | 0.295 | 2.395 | 0.17 | 0.7919 |
nayohan/llama3 - 8b - it - translation - general - en - ko - 1sent | 15.7875 | 8.09 | 4.445 | 4.68 | 8.2506 |
nayohan/llama3 - instrucTrans - enko - 8b | 16.3938 | 9.63 | 5.405 | 5.3225 | 9.1878 |
nayohan/llama3 - 8b - it - general - trc313k - enko - 8k | 14.7225 | 10.47 | 4.45 | 7.555 | 9.2994 |
Gemma | |||||
Translation - EnKo/gemma - 2 - 2b - it - general1.2m - trc313eval45 | 13.7775 | 7.88 | 3.95 | 6.105 | 7.9281 |
Translation - EnKo/gemma - 2 - 9b - it - general1.2m - trc313eval45 | 18.9887 | 13.215 | 6.28 | 9.975 | 12.1147 |
Translation - EnKo/gukbap - gemma - 2 - 9b - it - general1.2m - trc313eval45 | 18.405 | 12.44 | 6.59 | 9.64 | 11.7688 |
EXAONE | |||||
CarrotAI/EXAONE - 3.0 - 7.8B - Instruct - Llamafied - 8k | 4.9375 | 4.9 | 1.58 | 8.215 | 4.9081 |
Translation - EnKo/exaeon3 - translation - general - enko - 7.8b (private) | 17.8275 | 8.56 | 2.72 | 6.31 | 8.8544 |
Translation - EnKo/exaone3 - instrucTrans - v2 - enko - 7.8b | 19.6075 | 13.46 | 7.28 | 11.4425 | 12.9475 |
Performance Analysis by Training Dataset
Model Name | AIHub | Flores | IWSLT | News | Average |
---|---|---|---|---|---|
Meta - Llama | |||||
Meta - Llama - 3 - 8B - Instruct | 0.3075 | 0.295 | 2.395 | 0.17 | 0.7919 |
llama3 - 8b - it - general1.2m - en - ko - 4k | 15.7875 | 8.09 | 4.445 | 4.68 | 8.2506 |
llama3 - 8b - it - general1.2m - trc313k - enko - 4k | 16.3938 | 9.63 | 5.405 | 5.3225 | 9.1878 |
llama3 - 8b - it - general1.2m - trc313k - enko - 8k | 14.7225 | 10.47 | 4.45 | 7.555 | 9.2994 |
Gemma |
M2m100 418M
MIT
M2M100 is a multilingual encoder-decoder model supporting 9,900 translation directions across 100 languages
Machine Translation Supports Multiple Languages
M
facebook
1.6M
299
Opus Mt Fr En
Apache-2.0
A Transformer-based French-to-English neural machine translation model developed by the Helsinki-NLP team, trained on the OPUS multilingual dataset.
Machine Translation Supports Multiple Languages
O
Helsinki-NLP
1.2M
44
Opus Mt Ar En
Apache-2.0
Arabic-to-English machine translation model trained on OPUS data, using transformer-align architecture
Machine Translation Supports Multiple Languages
O
Helsinki-NLP
579.41k
42
M2m100 1.2B
MIT
M2M100 is a multilingual machine translation model supporting 100 languages, capable of direct translation across 9900 translation directions.
Machine Translation Supports Multiple Languages
M
facebook
501.82k
167
Indictrans2 Indic En 1B
MIT
A 1.1B-parameter machine translation model supporting mutual translation between 25 Indian languages and English, developed by the AI4Bharat project
Machine Translation
Transformers Supports Multiple Languages

I
ai4bharat
473.63k
14
Opus Mt En Zh
Apache-2.0
A Transformer-based English-to-Multidialectal Chinese translation model supporting translation tasks from English to 13 Chinese variants
Machine Translation Supports Multiple Languages
O
Helsinki-NLP
442.08k
367
Opus Mt Zh En
A Chinese-to-English machine translation model developed by the University of Helsinki, based on the OPUS corpus
Machine Translation Supports Multiple Languages
O
Helsinki-NLP
441.24k
505
Mbart Large 50 Many To Many Mmt
A multilingual machine translation model fine-tuned based on mBART-large-50, supporting translation between 50 languages
Machine Translation Supports Multiple Languages
M
facebook
404.66k
357
Opus Mt De En
Apache-2.0
opus-mt-de-en is a German-to-English machine translation model based on the transformer-align architecture, developed by the Helsinki-NLP team.
Machine Translation Supports Multiple Languages
O
Helsinki-NLP
404.33k
44
Opus Mt Es En
Apache-2.0
This is a machine translation model from Spanish to English based on the Transformer architecture, developed by the Helsinki-NLP team.
Machine Translation
Transformers Supports Multiple Languages

O
Helsinki-NLP
385.40k
71
Featured Recommended AI Models
ยฉ 2025AIbase