๐ Open-Solar-Ko
Solar-Ko is an advanced version of the upstage/SOLAR-10.7B-v1.0 model. It features an expanded vocabulary and incorporates a Korean corpus for enhanced pretraining. Open-Solar-Ko only uses publicly accessible Korean corpora, such as AI Hub, Modu Corpus, ๋ชจ๋์ ๋ง๋ญ์น, and Korean Wikipedia. Since it was trained solely with publicly available corpora, this model can be used freely by everyone under the Apache2.0 open - source license.
๐ Quick Start
This section is not provided in the original README, so it is skipped.
โจ Features
- Advanced Iteration: Solar-Ko is an advanced version of the upstage/SOLAR-10.7B-v1.0 model.
- Vocab Expansion: It has an expanded vocabulary, which improves tokenizing efficiency for Korean text.
- Open - Source and Public Data: Trained with publicly available Korean corpora, allowing unrestricted use under the Apache2.0 license.
๐ฆ Installation
This section is not provided in the original README, so it is skipped.
๐ป Usage Examples
This section is not provided in the original README, so it is skipped.
๐ Documentation
Update Log
- 2024.01.08: Initial Test version Release of Solar-Ko
Model Details
- Model Developers: Junbum Lee (Beomi)
- Variations: Solar-Ko is available with one parameter size โ 10B with Continual Pretrained version.
- Input: The model accepts only text input.
- Output: The model produces text output exclusively.
- Model Architecture: SOLAR-KO-10.7B is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
Property |
Details |
Model Type |
Auto - regressive language model based on an optimized transformer architecture derived from Llama - 2 |
Training Data |
A curated mix of Publicly Accessible Korean Corpora |
Parameters |
10.7B |
Content Length |
4k |
GQA |
O |
Tokens |
>15B* |
Learning Rate |
5e-5 |
Training Corpus
The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is as follows:
- AI Hub: corpus/AI_HUB
- Only the
Training
segment of the data was used.
- The
Validation
and Test
segments were deliberately excluded.
- Modu Corpus: corpus/MODU_CORPUS
The final JSONL dataset used to train this model is approximately 61GB in size. Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original SOLAR tokenizer, >60 billion tokens.)
Vocab Expansion
Model Name |
Vocabulary Size |
Description |
Original Solar |
32000 |
Sentencepiece BPE |
Expanded SOLAR-KO-10.7B |
46592 |
Sentencepiece BPE. Added Korean vocab and merges |
Tokenizing "์๋
ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์."
- SOLAR-10.7B: 26 tokens
- SOLAR-KO-10.7b: 8 tokens
Model |
Tokens |
SOLAR-10.7B |
['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '๋ ', '<0xEC>', '<0x94>', '<0xA8>', '๊ฐ', 'โ', '์ข', '๋ค', '์', '.'] |
SOLAR-KO-10.7B |
['โ์๋
', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์', '.'] |
Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"
- SOLAR-10.7B: 22 tokens
- SOLAR-KO-10.7b: 22 tokens
Model |
Tokens |
SOLAR-10.7B |
['โMeet', 'โ', '1', '0', '.', '7', 'B', 'โSolar', ':', 'โE', 'lev', 'ating', 'โPerformance', 'โwith', 'โUp', 'stage', 'โDep', 'th', 'โUP', 'โScal', 'ing', '!'] |
SOLAR-KO-10.7B |
['โMeet', 'โ', '1', '0', '.', '7', 'B', 'โSolar', ':', 'โE', 'lev', 'ating', 'โPerformance', 'โwith', 'โUp', 'stage', 'โDep', 'th', 'โUP', 'โScal', 'ing', '!'] |
Model Benchmark
LM Eval Harness - Korean (polyglot branch)
- Used EleutherAI's lm - evaluation - harness https://github.com/EleutherAI/lm - evaluation - harness/tree/polyglot
|
0 |
5 |
10 |
50 |
kobest_boolq (macro_f1) |
0.853949 |
0.88098 |
0.898139 |
0.902354 |
kobest_copa (macro_f1) |
0.804531 |
0.826736 |
0.837656 |
0.860899 |
kobest_hellaswag (macro_f1) |
0.507174 |
0.500983 |
0.487287 |
0.512182 |
kobest_sentineg (macro_f1) |
0.3517 |
0.972291 |
0.977321 |
0.984884 |
kohatespeech (macro_f1) |
0.258111 |
0.403957 |
0.386808 |
0.462393 |
kohatespeech_apeach (macro_f1) |
0.337667 |
0.651697 |
0.705337 |
0.827757 |
kohatespeech_gen_bias (macro_f1) |
0.124535 |
0.503464 |
0.498501 |
0.443218 |
korunsmile (f1) |
0.3814 |
0.356939 |
0.369989 |
0.296193 |
nsmc (acc) |
0.5356 |
0.87162 |
0.88654 |
0.89632 |
pawsx_ko (acc) |
0.5435 |
0.5245 |
0.5315 |
0.5385 |
Citation
@misc {solar_ko_junbum_2023,
author = { {L. Junbum} },
title = { Solar-Ko-10.7b },
year = 2024,
url = { https://huggingface.co/beomi/SOLAR-KO-10.7B },
publisher = { Hugging Face }
}
Acknowledgements
๐ License
Apache 2.0