KoreanLM Open-Source Language Model - Optimized for Korean Language Features, Offering an Efficient Word Segmentation Solution

Koreanlm

Developed by quantumaikr

KoreanLM is an open-source language model project specifically optimized for Korean, designed for Korean grammar and vocabulary characteristics, providing efficient tokenization solutions

Large Language Model

Transformers

Supports Multiple Languages#Korean optimization #Efficient tokenization #Multilingual generation

Downloads 59

Release Time : 5/3/2023

Model Overview

Focused on addressing the insufficient training and tokenization efficiency of Korean in existing language models, a generative language model built by incorporating Korean linguistic features

Model Features

Korean-specific optimization

Specially designed for Korean grammatical structures and vocabulary characteristics to enhance comprehension and generation accuracy

Efficient tokenization solution

Utilizes advanced Korean tokenization technology to significantly improve text processing efficiency

Lightweight design

Optimized model size for easier enterprise fine-tuning applications, addressing the high entry barrier of large models

Model Capabilities

Korean text generation

Korean-English bilingual processing

Contextual understanding

Use Cases

Natural Language Processing

Korean content creation

Automatically generates text content that conforms to Korean expression habits

Cross-language applications

Handles mixed Korean-English text scenarios

🚀 KoreanLM: Korean Language Model Project

KoreanLM is an open - source project dedicated to developing a Korean language model. Currently, most language models focus on English, resulting in relatively insufficient learning for Korean and inefficiencies in the tokenization process. To address these issues and provide a language model optimized for Korean, the KoreanLM project was initiated.

KoreanLM icon

🚀 Quick Start

KoreanLM is distributed via a GitHub repository. You can install the project as follows:

git clone https://github.com/quantumaikr/KoreanLM.git
cd KoreanLM
pip install -r requirements.txt

✨ Features

1. Develop a Korean - specific language model

Develop a language model that can understand and generate Korean more accurately by reflecting the grammar, vocabulary, and cultural characteristics of Korean.

2. Introduce an efficient tokenization method

Improve the performance of the language model by introducing a new tokenization method that enables efficient and accurate analysis during the tokenization of Korean text.

3. Improve the usability of large - scale language models

Adjust the size of the Korean language model to address the current issue where it is difficult for enterprises to fine - tune large - scale language models with their own data, making it easier to apply to natural language processing tasks.

💻 Usage Examples

Basic Usage

The following is an example of loading the model and tokenizer through the transformers library:

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained("quantumaikr/KoreanLM")
tokenizer = transformers.AutoTokenizer.from_pretrained("quantumaikr/KoreanLM")

🔧 Technical Details

Training (Fine - Tuning)

Standard Fine - Tuning

torchrun --nproc_per_node=4 --master_port=1004 train.py \
    --model_name_or_path quantumaikr/KoreanLM \
    --data_path korean_data.json \    
    --num_train_epochs 3 \
    --cache_dir './data' \
    --bf16 True \
    --tf32 True \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \

pip install deepspeed
torchrun --nproc_per_node=4 --master_port=1004 train.py \
    --deepspeed "./deepspeed.json" \
    --model_name_or_path quantumaikr/KoreanLM \
    --data_path korean_data.json \    
    --num_train_epochs 3 \
    --cache_dir './data' \
    --bf16 True \
    --tf32 True \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \

Training with LoRA

python finetune-lora.py \
    --base_model 'quantumaikr/KoreanLM' \
    --data_path './korean_data.json' \
    --output_dir './KoreanLM-LoRA' \
    --cache_dir './data'

Inference

python generate.py \
    --load_8bit  \
    --share_gradio \
    --base_model 'quantumaikr/KoreanLM'  \
    --lora_weights 'quantumaikr/KoreanLM-LoRA' \
    --cache_dir './data'

📚 Documentation

Pretrained Model Release and Web Demo

Trained Model

* The demo link will be released later.

How to Contribute

1. Raise an issue

Please raise issues or suggestions for improvement related to the KoreanLM project.

2. Write code

You can write code to add improvements or new features. Please submit your code via a Pull Request.

3. Write and translate documentation

Participate in writing or translating project documentation to improve the quality of the project.

4. Testing and feedback

Feedback on bugs or improvements found while using the project will be greatly appreciated.

📄 License

The KoreanLM project is licensed under the Apache 2.0 License. Please follow the license requirements when using the project.

Technical Inquiry

If you have any questions regarding the KoreanLM project, please contact us via email or GitHub issues. We hope this project will contribute to the research and development of Korean language models and welcome your interest and participation.

Email: hi@quantumai.kr

This repository has implementations inspired by open_llama, Stanford Alpaca and alpaca-lora projects.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご