ko-trocr-base-nsmc-news-chatbot Open Source Model - Free Deployment, Supports Image Text Recognition in Korean and Chinese

Ko Trocr Base Nsmc News Chatbot

Developed by daekeun-ml

This is a proof-of-concept model for Korean text recognition, trained on the TrOCR architecture, supporting Korean text extraction from images.

Image-to-Text

Transformers

KoreanOpen Source License:MIT #Korean OCR #Image to Text #Multi-scenario Adaptation

Downloads 44

Release Time : 11/22/2022

Model Overview

This model is a Korean text recognition model based on the TrOCR architecture, specifically designed to extract Korean text from images. Since TrOCR has not yet released a multilingual model including Korean, this model was developed as a proof-of-concept. It is recommended to fine-tune the model with additional collected data.

Model Features

Korean Text Recognition

OCR capabilities optimized specifically for Korean text, accurately recognizing Korean characters

Multi-domain Training Data

Trained on a mix of news summaries, movie reviews, and chatbot datasets to enhance model generalization

TrOCR Architecture

Transformer-based OCR architecture combining visual encoder and text decoder

Model Capabilities

Korean Text Recognition

Image to Text

Multi-domain Text Processing

Use Cases

Document Digitization

News Article Digitization

Convert printed or handwritten Korean news articles into editable text formats

Content Analysis

Movie Review Analysis

Extract movie review text from images for sentiment analysis

Chatbot

Chat Log Processing

Identify and process Korean chat logs from images

🚀 TrOCR for Korean Language (PoC)

TrOCR hasn't released a multilingual model including Korean yet. This project trained a Korean model for PoC purpose.

🚀 Quick Start

TrOCR has not launched a multilingual model that includes Korean. Therefore, we trained a Korean model for proof - of - concept (PoC). Based on this model, it is advisable to gather more data for additional first - stage training or conduct second - stage fine - tuning.

✨ Features

This is a Korean - specific TrOCR model for PoC.
It uses publicly available datasets for training.
The code for data collection and model training is open - sourced on GitHub.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import TrOCRProcessor, VisionEncoderDecoderModel, AutoTokenizer
import requests 
from io import BytesIO
from PIL import Image

processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") 
model = VisionEncoderDecoderModel.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")
tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/ko-trocr-base-nsmc-news-chatbot")

url = "https://raw.githubusercontent.com/aws-samples/sm-kornlp/main/trocr/sample_imgs/news_1.jpg"
response = requests.get(url)
img = Image.open(BytesIO(response.content))

pixel_values = processor(img, return_tensors="pt").pixel_values 
generated_ids = model.generate(pixel_values, max_length=64)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] 
print(generated_text)

📚 Documentation

Collecting data

Text data

We created training data by processing three types of datasets:

News summarization dataset: https://huggingface.co/datasets/daekeun-ml/naver-news-summarization-ko
Naver Movie Sentiment Classification: https://github.com/e9t/nsmc
Chatbot dataset: https://github.com/songys/Chatbot_data

For efficient data collection, each sentence was separated by a sentence separator library (Kiwi Python wrapper; https://github.com/bab2min/kiwipiepy), and as a result, 637,401 samples were collected.

Image Data

Image data was generated with TextRecognitionDataGenerator (https://github.com/Belval/TextRecognitionDataGenerator) introduced in the TrOCR paper. Below is a code snippet for generating images.

python3 ./trdg/run.py -i ocr_dataset_poc.txt -w 5 -t {num_cores} -f 64 -l ko -c {num_samples} -na 2 --output_dir {dataset_dir}

Training

Base model

The encoder model used facebook/deit-base-distilled-patch16-384 and the decoder model used klue/roberta-base. It is easier than training by starting weights from microsoft/trocr-base-stage1.

Parameters

We used heuristic parameters without separate hyperparameter tuning.

learning_rate = 4e - 5
epochs = 25
fp16 = True
max_length = 64

Usage

All the code required for data collection and model training has been published on the author's Github:

https://github.com/daekeun-ml/sm-kornlp-usecases/tree/main/trocr

🔧 Technical Details

The project uses specific encoder and decoder models for training the Korean TrOCR model. It also uses publicly available datasets for text and generates image data using a specific tool. The training parameters are set heuristically without separate hyperparameter tuning.

📄 License

The project is licensed under the MIT license.

Property	Details
Model Type	TrOCR for Korean Language (PoC)
Training Data	News summarization dataset, Naver Movie Sentiment Classification, Chatbot dataset
License	MIT

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご