
Model Overview
Model Features
Model Capabilities
Use Cases
๐ Llama-2-Ko ๐ฆ๐ฐ๐ท
Llama-2-Ko is an advanced version of Llama 2. It benefits from an expanded vocabulary and includes a Korean corpus in further pretraining. Similar to its predecessor, it's part of generative text models with parameter counts ranging from 7 billion to 70 billion. This repository focuses on the 7B pretrained version in the Hugging Face Transformers format. For other models, refer to the index below.
๐ Quick Start
There is no specific quick - start content in the original document, so this section is skipped.
โจ Features
Update Log
- 2023.12.27: A new model is available! It's trained with only open - accessible Korean text corpus: https://huggingface.co/beomi/open-llama-2-ko-7b
- 2023.10.19: Fixed the tokenizer bug (space not applied when decoding) after
transforemrs>=4.34.0
๐ Documentation
Model Details
- Model Developers: Junbum Lee (Beomi)
- Variations: Llama-2-Ko will come in parameter sizes of 7B, 13B, and 70B, with pretrained and fine - tuned variations.
- Input: The models take only text as input.
- Output: The models generate only text.
- Model Architecture: Llama-2-Ko is an auto - regressive language model using an optimized transformer architecture based on Llama-2.
Property | Details |
---|---|
Model Type | Auto - regressive language model with an optimized transformer architecture based on Llama - 2 |
Training Data | A new mix of Korean online data |
Params | 7B |
Content Length | 4k |
GQA | โ |
Tokens | >40B* (Plan to train up to 200B tokens) |
LR | 1e-5 |
Vocab Expansion
Model Name | Vocabulary Size | Description |
---|---|---|
Original Llama-2 | 32000 | Sentencepiece BPE |
Expanded Llama-2-Ko | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
Tokenizing "์๋ ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์."
Model | Tokens |
---|---|
Llama-2 | ['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '<0xEB>', '<0x82>', '<0xA0>', '์จ', '๊ฐ', 'โ', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์'] |
Llama-2-Ko | ['โ์๋
', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์'] |
Tokenizing "Llama 2: Open Foundation and Fine - Tuned Chat Models"
Model | Tokens |
---|---|
Llama-2 | ['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els'] |
Llama-2-Ko | ['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els'] |
Model Benchmark
LM Eval Harness - Korean (polyglot branch)
- Used EleutherAI's lm - evaluation - harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
NSMC (Acc) - 50000 full test
TBD
COPA (F1)
Model | 0 - shot | 5 - shot | 10 - shot | 50 - shot |
---|---|---|---|---|
https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5 | 0.6696 | 0.6477 | 0.6419 | 0.6514 |
https://huggingface.co/kakaobrain/kogpt | 0.7345 | 0.7287 | 0.7277 | 0.7479 |
https://huggingface.co/facebook/xglm-7.5B | 0.6723 | 0.6731 | 0.6769 | 0.7119 |
https://huggingface.co/EleutherAI/polyglot-ko-1.3b | 0.7196 | 0.7193 | 0.7204 | 0.7206 |
https://huggingface.co/EleutherAI/polyglot-ko-3.8b | 0.7595 | 0.7608 | 0.7638 | 0.7788 |
https://huggingface.co/EleutherAI/polyglot-ko-5.8b | 0.7745 | 0.7676 | 0.7775 | 0.7887 |
https://huggingface.co/EleutherAI/polyglot-ko-12.8b | 0.7937 | 0.8108 | 0.8037 | 0.8369 |
Llama - 2 Original 7B* | 0.562033 | 0.575982 | 0.576216 | 0.595532 |
Llama - 2 - Ko - 7b 20B (10k) | 0.738780 | 0.762639 | 0.780761 | 0.797863 |
Llama - 2 - Ko - 7b 40B (20k) | 0.743630 | 0.792716 | 0.803746 | 0.825944 |
*Llama - 2 Original 7B used https://huggingface.co/meta-llama/Llama-2-7b-hf (w/o tokenizer updated) |
HellaSwag (F1)
Model | 0 - shot | 5 - shot | 10 - shot | 50 - shot |
---|---|---|---|---|
https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5 | 0.5243 | 0.5272 | 0.5166 | 0.5352 |
https://huggingface.co/kakaobrain/kogpt | 0.5590 | 0.5833 | 0.5828 | 0.5907 |
https://huggingface.co/facebook/xglm-7.5B | 0.5665 | 0.5689 | 0.5565 | 0.5622 |
https://huggingface.co/EleutherAI/polyglot-ko-1.3b | 0.5247 | 0.5260 | 0.5278 | 0.5427 |
https://huggingface.co/EleutherAI/polyglot-ko-3.8b | 0.5707 | 0.5830 | 0.5670 | 0.5787 |
https://huggingface.co/EleutherAI/polyglot-ko-5.8b | 0.5976 | 0.5998 | 0.5979 | 0.6208 |
https://huggingface.co/EleutherAI/polyglot-ko-12.8b | 0.5954 | 0.6306 | 0.6098 | 0.6118 |
Llama - 2 Original 7B* | 0.415390 | 0.431382 | 0.421342 | 0.442003 |
Llama - 2 - Ko - 7b 20B (10k) | 0.451757 | 0.466751 | 0.472607 | 0.482776 |
Llama - 2 - Ko - 7b 40B (20k) | 0.456246 | 0.465665 | 0.469810 | 0.477374 |
*Llama - 2 Original 7B used https://huggingface.co/meta-llama/Llama-2-7b-hf (w/o tokenizer updated) |
BoolQ (F1)
Model | 0 - shot | 5 - shot | 10 - shot | 50 - shot |
---|---|---|---|---|
https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5 | 0.3356 | 0.4014 | 0.3640 | 0.3560 |
https://huggingface.co/kakaobrain/kogpt | 0.4514 | 0.5981 | 0.5499 | 0.5202 |
https://huggingface.co/facebook/xglm-7.5B | 0.4464 | 0.3324 | 0.3324 | 0.3324 |
https://huggingface.co/EleutherAI/polyglot-ko-1.3b | 0.3552 | 0.4751 | 0.4109 | 0.4038 |
https://huggingface.co/EleutherAI/polyglot-ko-3.8b | 0.4320 | 0.5263 | 0.4930 | 0.4038 |
https://huggingface.co/EleutherAI/polyglot-ko-5.8b | 0.4356 | 0.5698 | 0.5187 | 0.5236 |
https://huggingface.co/EleutherAI/polyglot-ko-12.8b | 0.4818 | 0.6041 | 0.6289 | 0.6448 |
Llama - 2 Original 7B* | 0.352050 | 0.563238 | 0.474788 | 0.419222 |
Llama - 2 - Ko - 7b 20B (10k) | 0.360656 | 0.679743 | 0.680109 | 0.662152 |
Llama - 2 - Ko - 7b 40B (20k) | 0.578640 | 0.697747 | 0.708358 | 0.714423 |
*Llama - 2 Original 7B used https://huggingface.co/meta-llama/Llama-2-7b-hf (w/o tokenizer updated) |
SentiNeg (F1)
Model | 0 - shot | 5 - shot | 10 - shot | 50 - shot |
---|---|---|---|---|
https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5 | 0.6065 | 0.6878 | 0.7280 | 0.8413 |
https://huggingface.co/kakaobrain/kogpt | 0.3747 | 0.8942 | 0.9294 | 0.9698 |
https://huggingface.co/facebook/xglm-7.5B | 0.3578 | 0.4471 | 0.3964 | 0.5271 |
https://huggingface.co/EleutherAI/polyglot-ko-1.3b | 0.6790 | 0.6257 | 0.5514 | 0.7851 |
https://huggingface.co/EleutherAI/polyglot-ko-3.8b | 0.4858 | 0.7950 | 0.7320 | 0.7851 |
https://huggingface.co/EleutherAI/polyglot-ko-5.8b | 0.3394 | 0.8841 | 0.8808 | 0.9521 |
https://huggingface.co/EleutherAI/polyglot-ko-12.8b | 0.9117 | 0.9015 | 0.9345 | 0.9723 |
Llama - 2 Original 7B* | 0.347502 | 0.529124 | 0.480641 | 0.788457 |
Llama - 2 - Ko - 7b 20B (10k) | 0.485546 | 0.829503 | 0.871141 | 0.851253 |
Llama - 2 - Ko - 7b 40B (20k) | 0.459447 | 0.761079 | 0.727611 | 0.936988 |
*Llama - 2 Original 7B used https://huggingface.co/meta-llama/Llama-2-7b-hf (w/o tokenizer updated) |
Note for oobabooga/text - generation - webui
Remove ValueError
at load_tokenizer
function (line 109 or near), in modules/models.py
.
diff --git a/modules/models.py b/modules/models.py
index 232d5fa..de5b7a0 100644
--- a/modules/models.py
+++ b/modules/models.py
@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
trust_remote_code=shared.args.trust_remote_code,
use_fast=False
)
- except ValueError:
+ except:
tokenizer = AutoTokenizer.from_pretrained(
path_to_model,
trust_remote_code=shared.args.trust_remote_code,
Since Llama - 2 - Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package, it is required to use use_fast=True
option when initialize tokenizer.
Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)
Citation
@misc {l._junbum_2023,
author = { {L. Junbum} },
title = { llama-2-ko-7b (Revision 4a9993e) },
year = 2023,
url = { https://huggingface.co/beomi/llama-2-ko-7b },
doi = { 10.57967/hf/1098 },
publisher = { Hugging Face }
}
Acknowledgement
The training is supported by TPU Research Cloud program.
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 39.43 |
ARC (25 - shot) | 48.46 |
HellaSwag (10 - shot) | 75.28 |
MMLU (5 - shot) | 39.56 |
TruthfulQA (0 - shot) | 34.49 |
Winogrande (5 - shot) | 72.14 |
GSM8K (5 - shot) | 1.97 |
DROP (3 - shot) | 4.1 |

