Llama-2-ko-7b Open Source Model - High-quality Korean text generation supporting Korean corpus expansion

Llama 2 Ko 7b

Developed by beomi

LLaMA-2-Korean Version is an advanced iteration of LLaMA 2, optimized for Korean text generation tasks through vocabulary expansion and additional Korean corpus pre-training.

Large Language Model

Transformers

Supports Multiple Languages#Korean Text Generation #Expanded Vocabulary #Multi-parameter Specifications

Downloads 3,451

Release Time : 7/20/2023

Model Overview

An autoregressive language model optimized from LLaMA 2, specifically enhanced with Korean vocabulary expansion and pre-training for Korean text generation tasks.

Model Features

Korean-Optimized Vocabulary

Expands the original LLaMA-2 vocabulary with Korean lexicons and merging rules, significantly improving Korean text processing capabilities.

Multi-parameter Specification Support

Offers 7B, 13B, and 70B parameter variants including pre-trained and fine-tuned versions to meet diverse application needs.

Efficient Tokenization

Delivers more natural and efficient Korean text tokenization compared to the original LLaMA-2, reducing redundant tokens.

Model Capabilities

Korean Text Generation

English Text Generation

Natural Language Understanding

Use Cases

Natural Language Processing

Korean Chatbot

Develop chatbots capable of fluent Korean conversation

Korean Content Generation

Automatically generate Korean articles, news, or social media content

🚀 Llama-2-Ko 🦙🇰🇷

Llama-2-Ko is an advanced version of Llama 2. It benefits from an expanded vocabulary and includes a Korean corpus in further pretraining. Similar to its predecessor, it's part of generative text models with parameter counts ranging from 7 billion to 70 billion. This repository focuses on the 7B pretrained version in the Hugging Face Transformers format. For other models, refer to the index below.

🚀 Quick Start

There is no specific quick - start content in the original document, so this section is skipped.

✨ Features

Update Log

2023.12.27: A new model is available! It's trained with only open - accessible Korean text corpus: https://huggingface.co/beomi/open-llama-2-ko-7b
2023.10.19: Fixed the tokenizer bug (space not applied when decoding) after transforemrs>=4.34.0

📚 Documentation

Model Details

Model Developers: Junbum Lee (Beomi)
Variations: Llama-2-Ko will come in parameter sizes of 7B, 13B, and 70B, with pretrained and fine - tuned variations.
Input: The models take only text as input.
Output: The models generate only text.
Model Architecture: Llama-2-Ko is an auto - regressive language model using an optimized transformer architecture based on Llama-2.

Property	Details
Model Type	Auto - regressive language model with an optimized transformer architecture based on Llama - 2
Training Data	A new mix of Korean online data
Params	7B
Content Length	4k
GQA	❌
Tokens	>40B* (Plan to train up to 200B tokens)
LR	1e^-5

Vocab Expansion

Model Name	Vocabulary Size	Description
Original Llama-2	32000	Sentencepiece BPE
Expanded Llama-2-Ko	46336	Sentencepiece BPE. Added Korean vocab and merges

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

Model	Tokens
Llama-2	`['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '씨', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '요']`
Llama-2-Ko	`['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요']`

Tokenizing "Llama 2: Open Foundation and Fine - Tuned Chat Models"

Model	Tokens
Llama-2	`['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']`
Llama-2-Ko	`['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']`

Model Benchmark

LM Eval Harness - Korean (polyglot branch)

Used EleutherAI's lm - evaluation - harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot

NSMC (Acc) - 50000 full test

TBD

COPA (F1)

Model	0 - shot	5 - shot	10 - shot	50 - shot
https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5	0.6696	0.6477	0.6419	0.6514
https://huggingface.co/kakaobrain/kogpt	0.7345	0.7287	0.7277	0.7479
https://huggingface.co/facebook/xglm-7.5B	0.6723	0.6731	0.6769	0.7119
https://huggingface.co/EleutherAI/polyglot-ko-1.3b	0.7196	0.7193	0.7204	0.7206
https://huggingface.co/EleutherAI/polyglot-ko-3.8b	0.7595	0.7608	0.7638	0.7788
https://huggingface.co/EleutherAI/polyglot-ko-5.8b	0.7745	0.7676	0.7775	0.7887
https://huggingface.co/EleutherAI/polyglot-ko-12.8b	0.7937	0.8108	0.8037	0.8369
Llama - 2 Original 7B*	0.562033	0.575982	0.576216	0.595532
Llama - 2 - Ko - 7b 20B (10k)	0.738780	0.762639	0.780761	0.797863
Llama - 2 - Ko - 7b 40B (20k)	0.743630	0.792716	0.803746	0.825944
*Llama - 2 Original 7B used https://huggingface.co/meta-llama/Llama-2-7b-hf (w/o tokenizer updated)

HellaSwag (F1)

Model	0 - shot	5 - shot	10 - shot	50 - shot
https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5	0.5243	0.5272	0.5166	0.5352
https://huggingface.co/kakaobrain/kogpt	0.5590	0.5833	0.5828	0.5907
https://huggingface.co/facebook/xglm-7.5B	0.5665	0.5689	0.5565	0.5622
https://huggingface.co/EleutherAI/polyglot-ko-1.3b	0.5247	0.5260	0.5278	0.5427
https://huggingface.co/EleutherAI/polyglot-ko-3.8b	0.5707	0.5830	0.5670	0.5787
https://huggingface.co/EleutherAI/polyglot-ko-5.8b	0.5976	0.5998	0.5979	0.6208
https://huggingface.co/EleutherAI/polyglot-ko-12.8b	0.5954	0.6306	0.6098	0.6118
Llama - 2 Original 7B*	0.415390	0.431382	0.421342	0.442003
Llama - 2 - Ko - 7b 20B (10k)	0.451757	0.466751	0.472607	0.482776
Llama - 2 - Ko - 7b 40B (20k)	0.456246	0.465665	0.469810	0.477374
*Llama - 2 Original 7B used https://huggingface.co/meta-llama/Llama-2-7b-hf (w/o tokenizer updated)

BoolQ (F1)

Model	0 - shot	5 - shot	10 - shot	50 - shot
https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5	0.3356	0.4014	0.3640	0.3560
https://huggingface.co/kakaobrain/kogpt	0.4514	0.5981	0.5499	0.5202
https://huggingface.co/facebook/xglm-7.5B	0.4464	0.3324	0.3324	0.3324
https://huggingface.co/EleutherAI/polyglot-ko-1.3b	0.3552	0.4751	0.4109	0.4038
https://huggingface.co/EleutherAI/polyglot-ko-3.8b	0.4320	0.5263	0.4930	0.4038
https://huggingface.co/EleutherAI/polyglot-ko-5.8b	0.4356	0.5698	0.5187	0.5236
https://huggingface.co/EleutherAI/polyglot-ko-12.8b	0.4818	0.6041	0.6289	0.6448
Llama - 2 Original 7B*	0.352050	0.563238	0.474788	0.419222
Llama - 2 - Ko - 7b 20B (10k)	0.360656	0.679743	0.680109	0.662152
Llama - 2 - Ko - 7b 40B (20k)	0.578640	0.697747	0.708358	0.714423
*Llama - 2 Original 7B used https://huggingface.co/meta-llama/Llama-2-7b-hf (w/o tokenizer updated)

SentiNeg (F1)

Model	0 - shot	5 - shot	10 - shot	50 - shot
https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5	0.6065	0.6878	0.7280	0.8413
https://huggingface.co/kakaobrain/kogpt	0.3747	0.8942	0.9294	0.9698
https://huggingface.co/facebook/xglm-7.5B	0.3578	0.4471	0.3964	0.5271
https://huggingface.co/EleutherAI/polyglot-ko-1.3b	0.6790	0.6257	0.5514	0.7851
https://huggingface.co/EleutherAI/polyglot-ko-3.8b	0.4858	0.7950	0.7320	0.7851
https://huggingface.co/EleutherAI/polyglot-ko-5.8b	0.3394	0.8841	0.8808	0.9521
https://huggingface.co/EleutherAI/polyglot-ko-12.8b	0.9117	0.9015	0.9345	0.9723
Llama - 2 Original 7B*	0.347502	0.529124	0.480641	0.788457
Llama - 2 - Ko - 7b 20B (10k)	0.485546	0.829503	0.871141	0.851253
Llama - 2 - Ko - 7b 40B (20k)	0.459447	0.761079	0.727611	0.936988
*Llama - 2 Original 7B used https://huggingface.co/meta-llama/Llama-2-7b-hf (w/o tokenizer updated)

Note for oobabooga/text - generation - webui

Remove ValueError at load_tokenizer function (line 109 or near), in modules/models.py.

diff --git a/modules/models.py b/modules/models.py
index 232d5fa..de5b7a0 100644
--- a/modules/models.py
+++ b/modules/models.py
@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
                 trust_remote_code=shared.args.trust_remote_code,
                 use_fast=False
             )
-        except ValueError:
+        except:
             tokenizer = AutoTokenizer.from_pretrained(
                 path_to_model,
                 trust_remote_code=shared.args.trust_remote_code,

Since Llama - 2 - Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package, it is required to use use_fast=True option when initialize tokenizer.

Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)

Citation

@misc {l._junbum_2023,
	author       = { {L. Junbum} },
	title        = { llama-2-ko-7b (Revision 4a9993e) },
	year         = 2023,
	url          = { https://huggingface.co/beomi/llama-2-ko-7b },
	doi          = { 10.57967/hf/1098 },
	publisher    = { Hugging Face }
}

Acknowledgement

The training is supported by TPU Research Cloud program.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	39.43
ARC (25 - shot)	48.46
HellaSwag (10 - shot)	75.28
MMLU (5 - shot)	39.56
TruthfulQA (0 - shot)	34.49
Winogrande (5 - shot)	72.14
GSM8K (5 - shot)	1.97
DROP (3 - shot)	4.1

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご