đ TrOCR for Korean Language (PoC)
TrOCR hasn't released a multilingual model including Korean yet. This project trained a Korean model for PoC purpose.
đ Quick Start
TrOCR has not launched a multilingual model that includes Korean. Therefore, we trained a Korean model for proof - of - concept (PoC). Based on this model, it is advisable to gather more data for additional first - stage training or conduct second - stage fine - tuning.
⨠Features
- This is a Korean - specific TrOCR model for PoC.
- It uses publicly available datasets for training.
- The code for data collection and model training is open - sourced on GitHub.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import TrOCRProcessor, VisionEncoderDecoderModel, AutoTokenizer
import requests
from io import BytesIO
from PIL import Image
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
url = "https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_1.jpg"
response = requests.get(url)
img = Image.open(BytesIO(response.content))
pixel_values = processor(img, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values, max_length=64)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
đ Documentation
Collecting data
Text data
We created training data by processing three types of datasets:
- News summarization dataset: https://huggingface.co/datasets/daekeun-ml/naver-news-summarization-ko
- Naver Movie Sentiment Classification: https://github.com/e9t/nsmc
- Chatbot dataset: https://github.com/songys/Chatbot_data
For efficient data collection, each sentence was separated by a sentence separator library (Kiwi Python wrapper; https://github.com/bab2min/kiwipiepy), and as a result, 637,401 samples were collected.
Image Data
Image data was generated with TextRecognitionDataGenerator (https://github.com/Belval/TextRecognitionDataGenerator) introduced in the TrOCR paper.
Below is a code snippet for generating images.
python3 ./trdg/run.py -i ocr_dataset_poc.txt -w 5 -t {num_cores} -f 64 -l ko -c {num_samples} -na 2 --output_dir {dataset_dir}
Training
Base model
The encoder model used facebook/deit-base-distilled-patch16-384
and the decoder model used klue/roberta-base
. It is easier than training by starting weights from microsoft/trocr-base-stage1
.
Parameters
We used heuristic parameters without separate hyperparameter tuning.
- learning_rate = 4e - 5
- epochs = 25
- fp16 = True
- max_length = 64
Usage
All the code required for data collection and model training has been published on the author's Github:
- https://github.com/daekeun-ml/sm-kornlp-usecases/tree/main/trocr
đ§ Technical Details
The project uses specific encoder and decoder models for training the Korean TrOCR model. It also uses publicly available datasets for text and generates image data using a specific tool. The training parameters are set heuristically without separate hyperparameter tuning.
đ License
The project is licensed under the MIT license.
Property |
Details |
Model Type |
TrOCR for Korean Language (PoC) |
Training Data |
News summarization dataset, Naver Movie Sentiment Classification, Chatbot dataset |
License |
MIT |