đ KoreanLM: Korean Language Model Project
KoreanLM is an open - source project dedicated to developing a Korean language model. Currently, most language models focus on English, resulting in relatively insufficient learning for Korean and inefficiencies in the tokenization process. To address these issues and provide a language model optimized for Korean, the KoreanLM project was initiated.
đ Quick Start
KoreanLM is distributed via a GitHub repository. You can install the project as follows:
git clone https://github.com/quantumaikr/KoreanLM.git
cd KoreanLM
pip install -r requirements.txt
⨠Features
1. Develop a Korean - specific language model
Develop a language model that can understand and generate Korean more accurately by reflecting the grammar, vocabulary, and cultural characteristics of Korean.
2. Introduce an efficient tokenization method
Improve the performance of the language model by introducing a new tokenization method that enables efficient and accurate analysis during the tokenization of Korean text.
3. Improve the usability of large - scale language models
Adjust the size of the Korean language model to address the current issue where it is difficult for enterprises to fine - tune large - scale language models with their own data, making it easier to apply to natural language processing tasks.
đģ Usage Examples
Basic Usage
The following is an example of loading the model and tokenizer through the transformers
library:
import transformers
model = transformers.AutoModelForCausalLM.from_pretrained("quantumaikr/KoreanLM")
tokenizer = transformers.AutoTokenizer.from_pretrained("quantumaikr/KoreanLM")
đ§ Technical Details
Training (Fine - Tuning)
Standard Fine - Tuning
torchrun --nproc_per_node=4 --master_port=1004 train.py \
--model_name_or_path quantumaikr/KoreanLM \
--data_path korean_data.json \
--num_train_epochs 3 \
--cache_dir './data' \
--bf16 True \
--tf32 True \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
pip install deepspeed
torchrun --nproc_per_node=4 --master_port=1004 train.py \
--deepspeed "./deepspeed.json" \
--model_name_or_path quantumaikr/KoreanLM \
--data_path korean_data.json \
--num_train_epochs 3 \
--cache_dir './data' \
--bf16 True \
--tf32 True \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
Training with LoRA
python finetune-lora.py \
--base_model 'quantumaikr/KoreanLM' \
--data_path './korean_data.json' \
--output_dir './KoreanLM-LoRA' \
--cache_dir './data'
Inference
python generate.py \
--load_8bit \
--share_gradio \
--base_model 'quantumaikr/KoreanLM' \
--lora_weights 'quantumaikr/KoreanLM-LoRA' \
--cache_dir './data'
đ Documentation
Pretrained Model Release and Web Demo
Trained Model
* The demo link will be released later.
How to Contribute
1. Raise an issue
Please raise issues or suggestions for improvement related to the KoreanLM project.
2. Write code
You can write code to add improvements or new features. Please submit your code via a Pull Request.
3. Write and translate documentation
Participate in writing or translating project documentation to improve the quality of the project.
4. Testing and feedback
Feedback on bugs or improvements found while using the project will be greatly appreciated.
đ License
The KoreanLM project is licensed under the Apache 2.0 License. Please follow the license requirements when using the project.
Technical Inquiry
If you have any questions regarding the KoreanLM project, please contact us via email or GitHub issues. We hope this project will contribute to the research and development of Korean language models and welcome your interest and participation.
Email: hi@quantumai.kr
This repository has implementations inspired by open_llama, Stanford Alpaca and alpaca-lora projects.