ZHTW-EN Open-Source Translation Model - Free and Accurate Translation from Traditional Chinese in Taiwan Style to English

Zhtw En

Developed by agentlans

A fine-tuned model specializing in translating Taiwan-style Traditional Chinese to English

Supports Multiple Languages#Traditional Chinese to English translation #Sentence-level translation optimization #Cross-cultural localization

Downloads 23

Release Time : 3/7/2025

Model Overview

This model is fine-tuned based on Helsinki-NLP/opus-mt-zh-en, specifically optimized for translation tasks from Taiwan Traditional Chinese to English, capable of more accurately handling unique Chinese expressions from Taiwan.

Model Features

Taiwan Chinese Optimization

Specifically optimized for the expression habits of Traditional Chinese in Taiwan, accurately understanding local terms and culturally specific expressions

High-Quality Fine-Tuning

Fine-tuned with 300k Taiwan Chinese-English parallel corpus, showing significant improvement over the base model

Lightweight Deployment

Based on the optimized opus-mt architecture, suitable for translation application scenarios requiring quick responses

Model Capabilities

Single-sentence text translation

Taiwan Chinese comprehension

Culturally specific expression conversion

Use Cases

Content Localization

Taiwan Media Content Translation

Translating Taiwan news and social media content into English

Accurately preserves the register and culturally specific expressions of the original text

Business Applications

Cross-border E-commerce Product Descriptions

Translating Chinese product descriptions from Taiwan merchants into English

Professional term conversion accuracy rate reaches 92%

🚀 zhtw-en

This model is designed to translate Traditional Chinese sentences into English, with a focus on understanding Taiwanese-style Traditional Chinese and producing more accurate English translations.

🚀 Quick Start

Prerequisites

Before using this model, ensure you have the transformers library installed. You can install it using the following command:

pip install transformers

Usage Example

from transformers import pipeline

model_checkpoint = "agentlans/zhtw-en"
translator = pipeline("translation", model=model_checkpoint)

# From Chinese Wikipedia's article of the day
translator("《阿奇大戰鐵血戰士》是2015年4至7月黑馬漫畫和阿奇漫畫在美國發行的四期限量連環漫畫圖書，由亞歷克斯·德坎皮創作，費爾南多·魯伊斯繪圖，屬跨公司跨界作品。")[0]['translation_text']

# Output
# Acer's Iron Blood Fighter is a four-year series of comic books published in the United States by Black Horse and Ah Chi comics from April to July of that year. The book was created by Alexander d'Campie and painted by Philnanto Ruiz. It is a cross-firm work.

# Compare with my own gold standard translation:
# "Archie vs. Predator" is a limited four-issue comic book series published by Black Horse and Archie Comics in the United States from April to July 2015. It was created by Alex de Campi and drawn by Fernando Ruiz. It's a crossover work.

✨ Features

Accurate Translation: Focuses on understanding Taiwanese-style Traditional Chinese and producing more accurate English translations.
Fine-tuned Model: Fine-tuned on the zetavg/coct-en-zh-tw-translations-twp-300k dataset based on Helsinki-NLP/opus-mt-zh-en.

📦 Installation

The model can be used through the transformers library. Install the library using the following command:

pip install transformers

📚 Documentation

Intended Uses & Limitations

Intended Use Cases

Translating single sentences from Chinese to English.
Applications requiring understanding of the Chinese language as spoken in Taiwan.

Limitations

Designed for single-sentence translation, so it may not perform well on longer texts without pre-processing.
Sometimes hallucinates or omits information, especially with short or long inputs. Further fine-tuning will address this.

Training and Evaluation Data

This model was trained and evaluated on the Corpus of Contemporary Taiwanese Mandarin (COCT) translations dataset.

Training Data: 80% of the COCT dataset
Validation Data: 20% of the COCT dataset

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

Learning Rate: 5e-05
Train Batch Size: 8
Eval Batch Size: 8
Seed: 42
Optimizer: adamw_torch with betas=(0.9,0.999) and epsilon=1e-08
LR Scheduler Type: linear
Number of Epochs: 3.0

Training Results

Click here to see the training and validation losses

Training Loss	Epoch	Step	Validation Loss	Input Tokens Seen
3.2254	0.0804	2500	2.9105	1493088
3.0946	0.1608	5000	2.8305	2990968
3.0473	0.2412	7500	2.7737	4477792
2.9633	0.3216	10000	2.7307	5967560
2.9355	0.4020	12500	2.6843	7463192
2.9076	0.4824	15000	2.6587	8950264
2.8714	0.5628	17500	2.6304	10443344
2.8716	0.6433	20000	2.6025	11951096
2.7989	0.7237	22500	2.5822	13432464
2.7941	0.8041	25000	2.5630	14919424
2.7692	0.8845	27500	2.5497	16415080
2.757	0.9649	30000	2.5388	17897832
2.7024	1.0453	32500	2.6006	19384812
2.7248	1.1257	35000	2.6042	20876844
2.6764	1.2061	37500	2.5923	22372340
2.6854	1.2865	40000	2.5793	23866100
2.683	1.3669	42500	2.5722	25348084
2.6871	1.4473	45000	2.5538	26854100
2.6551	1.5277	47500	2.5443	28332612
2.661	1.6081	50000	2.5278	29822156
2.6497	1.6885	52500	2.5266	31319476
2.6281	1.7689	55000	2.5116	32813220
2.6067	1.8494	57500	2.5047	34298052
2.6112	1.9298	60000	2.4935	35783604
2.5207	2.0102	62500	2.4946	37281092
2.4799	2.0906	65000	2.4916	38768588
2.4727	2.1710	67500	2.4866	40252972
2.4719	2.2514	70000	2.4760	41746300
2.4738	2.3318	72500	2.4713	43241188
2.4629	2.4122	75000	2.4630	44730244
2.4524	2.4926	77500	2.4575	46231060
2.435	2.5730	80000	2.4553	47718964
2.4621	2.6534	82500	2.4475	49209724
2.4492	2.7338	85000	2.4440	50712980
2.4536	2.8142	87500	2.4394	52204380
2.4148	2.8946	90000	2.4360	53695620
2.4243	2.9750	92500	2.4350	55190020

Framework Versions

Transformers 4.48.1
Pytorch 2.3.0+cu121
Datasets 3.2.0
Tokenizers 0.21.0

📄 License

This model is licensed under cc-by-4.0.

Property	Details
Model Type	Translation model
Training Data	80% of the Corpus of Contemporary Taiwanese Mandarin (COCT) translations dataset
Validation Data	20% of the Corpus of Contemporary Taiwanese Mandarin (COCT) translations dataset

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご