KoLlama2-7b Open-Source Large Language Model - Enhance the Korean performance of Llama2 and provide a high-quality interactive experience

Kollama2 7b

Developed by psymon

KoLlama2 is an open-source project aimed at improving the Korean performance of the English-based large language model Llama2 and bringing a better language interaction experience to Korean users.

Large Language Model

Transformers

Supports Multiple Languages#Korean optimization #LoRA fine-tuning #Multilingual support

Downloads 109

Release Time : 7/19/2023

Model Overview

KoLlama2 focuses on improving Korean performance by using the Korean instruction dataset kullm-v2 publicly available from the NLP & AI Laboratory of Korea University and the HIAI Research Institute for LoRA fine-tuning, addressing the issue of the extremely low proportion of Korean in the training data of current large language models.

Model Features

Targeted Korean optimization

The first version of KoLlama2 uses the Korean instruction dataset kullm-v2 publicly available from the NLP & AI Laboratory of Korea University and the HIAI Research Institute for LoRA fine-tuning, focusing on improving Korean performance.

Solving the Korean dilemma

Aiming at the problem of the extremely low proportion of Korean in the training data of current large language models, it is committed to exploring optimization solutions to enhance the experience of Korean users with the rich capabilities of large language models.

Attempts with multiple fine-tuning methods

Attempt different methods such as QLoRA, LoRA, and full fine-tuning to explore the best solution for improving Korean capabilities.

Model Capabilities

Korean text generation

Korean dialogue interaction

Korean instruction understanding

Use Cases

Language interaction

Korean chatbot

Used to build a Korean dialogue system to provide a more natural Korean interaction experience.

Enhance the experience of Korean users with the rich capabilities of large language models

Education

Korean learning assistance

Help Korean learners with language practice and understanding.

🚀 🦙 KoLlama2-7b Repository 🦙

KoLlama2 (Korean Large Language Model Meta AI 2) is an open - source project aimed at enhancing the Korean performance of Llama2, an English - based LLM.

Datasets

nlpai-lab/kullm - v2

Languages

Korean
English

🚀 Quick Start

This section provides a high - level overview of the KoLlama2 project. For more detailed information, please refer to the corresponding sections below.

✨ Features

Addressing Language Imbalance: KoLlama2 aims to solve the problem of the low proportion of Korean in the pre - training data of large - scale language models, enabling Korean users to better experience the capabilities of LLMs.
Multiple Approaches: It explores various methods such as different fine - tuning techniques and applying diverse datasets to improve Korean proficiency.

📚 Documentation

Problem

From GPT3 to Bert and Llama2, the remarkable progress of large - scale language models has attracted widespread attention. However, due to the nature of LLMs pre - training on large corpora, most of the training data is in English, and Korean accounts for a very small percentage.

Percentage of Korean in GPT3's pre - training data: 0.01697%

Source: https://github.com/openai/gpt - 3/blob/master/dataset_statistics/languages_by_word_count.csv

- **Percentage of Korean in the Llama2 model's pre - training data**: 0.06%

Source: 22p Table 10, Llama 2: Open Foundation and Fine - Tuned Chat Models, Hugo Touvron et al, July 18 - 2023.

This percentage is significantly lower than the proportion of Korean speakers (81.7M) in the world's population (7.888 billion) (1.035%). This is due to multiple factors such as the isolated nature of Korean and the lack of a well - prepared Korean corpus, which severely limits Korean users from fully experiencing the rich capabilities of LLMs.

Problem Statement

Korean - based LLM Pretrain

One of the best solutions is to create a self - developed language model pre - trained with Korean data. This approach is being led by large, well - funded companies.

Naver HyperCLOVA X: https://clova.ai/hyperclova
Kakao KoGPT: https://github.com/kakaobrain/kogpt
EleutherAI polyglot - ko: https://github.com/EleutherAI/polyglot

Although this approach can effectively solve the problem of the lack of Korean proficiency in LLMs, the rapid development of LLMs makes it difficult to accurately predict future trends and train large - scale language models to adapt to every new change. Therefore, we need a lighter and faster method that can be used in parallel with training our own language models.

Fine - tuning a Foreign - language - based LLM

Fine - tuning a foreign - language - based LLM into Korean is a good solution. The following attempts have been made based on the LLaMa model:

KoAlpaca: https://github.com/Beomi/KoAlpaca
KULLM: https://github.com/nlpai - lab/KULLM
KoVicuna: https://github.com/melodysdreamj/KoVicuna
KORani: https://github.com/krafton - ai/KORani

These attempts have increased interest in open - source LLMs and helped us understand different fine - tuning methods. However, their limitations are also obvious:

For the LLaMA model, since Korean was excluded from the pre - training data, no method (including Full - Finetuning, LoRA, and QLoRA) could achieve satisfactory Korean performance.
There was no unified evaluation method for Korean language learning, making it difficult to determine the most effective learning method.
Each project was developed sporadically by individual entities, resulting in redundant efforts.

KoLlama2 Project Suggested

Based on the experience from the LLaMA model, KoLlama2 aims to find the best way to fine - tune a foreign - language - based LLM into Korean. The following attempts are required:

Try different methodologies such as QLoRA, LoRA, and Full - Finetuning to evaluate the improvement of the 0.01697% Korean proficiency included in Llama2.
Apply various datasets such as Alpaca and Vicuna to identify the most effective dataset type for improving Korean proficiency.
Experiment with new techniques, including curriculum learning that gradually increases the difficulty from simple English - to - Korean translation, additional pre - learning steps with a large Korean corpus, and vocabulary expansion used in Chinese - LLaMA.
Devise a reasonable evaluation method to assess each methodology.

Benchmarks

This section is yet to be filled with specific benchmarking information.

References

[Research Paper](https://ai.meta.com/research/publications/llama - 2 - open - foundation - and - fine - tuned - chat - models/)
[Llama 2 technical overview](https://ai.meta.com/resources/models - and - libraries/llama)
[Open Innovation AI Research Community](https://ai.meta.com/llama/open - innovation - ai - research - community/)

📦 Installation

Download

⚠️ Important Note 7/18: We're aware that some users are encountering download issues. If you still face problems, please remove all local files, re - clone the repository, and [request a new download link](https://ai.meta.com/resources/models - and - libraries/llama - downloads/). It's crucial to do this in case you have local corrupt files. When you receive the email, copy only the link text (it should start with https://download.llamameta.net, not https://l.facebook.com, as the latter will cause errors).

To download the model weights and tokenizer, visit the [Meta AI website](https://ai.meta.com/resources/models - and - libraries/llama - downloads/) and accept the License. Once your request is approved, you will receive a signed URL via email. Then run the download.sh script and enter the provided URL when prompted to start the download. Make sure to copy the URL text directly, do not use the 'Copy link address' option when right - clicking the URL. If the copied URL starts with https://download.llamameta.net, you've copied it correctly. If it starts with https://l.facebook.com, the copy is incorrect.

Pre - requisites: Ensure that wget and md5sum are installed. Then run the script: ./download.sh.

Note that the links expire after 24 hours and a certain number of downloads. If you encounter errors like 403: Forbidden, you can request a new link.

Access on Hugging Face

You can also download from [Hugging Face](https://huggingface.co/meta - llama). First, request a download from the Meta AI website using the same email as your Hugging Face account. After that, you can request access to any of the models on Hugging Face, and your account will be granted access to all versions within 1 - 2 days.

Setup

In a conda environment with PyTorch / CUDA available, clone the repository and run the following command in the top - level directory:

pip install -e .

💻 Usage Examples

Inference

Different models require different model - parallel (MP) values:

Model	MP
7B	1
13B	2
70B	8

All models support a sequence length of up to 4096 tokens. However, the cache is pre - allocated according to the max_seq_len and max_batch_size values. So, set these values according to your hardware.

Pretrained Models

These models are not fine - tuned for chat or Q&A. They should be prompted so that the expected answer is the natural continuation of the prompt.

See example_text_completion.py for examples. To run it with the llama - 2 - 7b model (nproc_per_node needs to be set to the MP value):

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama - 2 - 7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4

Fine - tuned Chat Models

The fine - tuned models are trained for dialogue applications. To achieve the expected features and performance, a specific formatting defined in chat_completion must be followed, including the INST and <<SYS>> tags, BOS and EOS tokens, and the appropriate whitespaces and breaklines (we recommend calling strip() on inputs to avoid double - spaces).

You can also deploy additional classifiers to filter out unsafe inputs and outputs. See the llama - recipes repo for [an example](https://github.com/facebookresearch/llama - recipes/blob/main/inference/inference.py) of adding a safety checker to the inputs and outputs of your inference code.

Example of using llama - 2 - 7b - chat:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama - 2 - 7b - chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 4

🔧 Technical Details

Llama 2 is a new technology that carries potential risks with use. Testing conducted to date has not — and could not — cover all scenarios. To help developers address these risks, we have created the [Responsible Use Guide](Responsible - Use - Guide.pdf). More details can be found in our research paper.

📄 License

Our model and weights are licensed for both researchers and commercial entities, adhering to the principles of openness. Our mission is to empower individuals and the industry through this opportunity while promoting an environment of discovery and ethical AI advancements. See the LICENSE file, as well as our accompanying Acceptable Use Policy.

Original LLaMA

The repository for the original llama release is in the llama_v1 branch.

Issues

Please report any software “bug,” or other problems with the models through one of the following means:

Reporting issues with the model: github.com/facebookresearch/llama
Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
Reporting bugs and security concerns: facebook.com/whitehat/info

Model Card

See MODEL_CARD.md.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご