OPEN - SOLAR - KO - 10.7B Open - source Korean AI Model: Expanded Corpora to Boost Korean Communication and Application

OPEN SOLAR KO 10.7B

Developed by beomi

Korean-enhanced version based on SOLAR-10.7B-v1.0, continuously pre-trained by expanding vocabulary and Korean corpus

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Korean optimization #Extended vocabulary #Public corpus training

Downloads 1,151

Release Time : 1/2/2024

Model Overview

Open-Solar-Ko is a 10.7B-parameter large language model focused on Korean processing, improving Korean text generation capabilities through vocabulary expansion and Korean corpus training

Model Features

Korean-optimized vocabulary

Expanded original vocabulary to 46,592, significantly improving Korean tokenization efficiency (example text token count reduced from 26 to 8)

Curated public corpus

Trained exclusively on publicly available Korean corpora like AI Hub, Modu Corpus, and Korean Wikipedia, complying with open-source licenses

Efficient architecture

Adopts optimized architecture with 4k context length and supports GQA (Grouped Query Attention)

Model Capabilities

Korean text generation

English text generation

Korean understanding tasks

Use Cases

Natural language processing

Korean text generation

Generate contextually appropriate Korean text content

Sentiment analysis

Analyze sentiment tendencies in Korean text

Achieved 0.896 accuracy on nsmc test set (50-shot)

🚀 Open-Solar-Ko

Solar-Ko is an advanced version of the upstage/SOLAR-10.7B-v1.0 model. It features an expanded vocabulary and incorporates a Korean corpus for enhanced pretraining. Open-Solar-Ko only uses publicly accessible Korean corpora, such as AI Hub, Modu Corpus, 모두의 말뭉치, and Korean Wikipedia. Since it was trained solely with publicly available corpora, this model can be used freely by everyone under the Apache2.0 open - source license.

🚀 Quick Start

This section is not provided in the original README, so it is skipped.

✨ Features

Advanced Iteration: Solar-Ko is an advanced version of the upstage/SOLAR-10.7B-v1.0 model.
Vocab Expansion: It has an expanded vocabulary, which improves tokenizing efficiency for Korean text.
Open - Source and Public Data: Trained with publicly available Korean corpora, allowing unrestricted use under the Apache2.0 license.

📦 Installation

This section is not provided in the original README, so it is skipped.

💻 Usage Examples

This section is not provided in the original README, so it is skipped.

📚 Documentation

Update Log

2024.01.08: Initial Test version Release of Solar-Ko

Model Details

Model Developers: Junbum Lee (Beomi)
Variations: Solar-Ko is available with one parameter size — 10B with Continual Pretrained version.
Input: The model accepts only text input.
Output: The model produces text output exclusively.
Model Architecture: SOLAR-KO-10.7B is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

Property	Details
Model Type	Auto - regressive language model based on an optimized transformer architecture derived from Llama - 2
Training Data	A curated mix of Publicly Accessible Korean Corpora
Parameters	10.7B
Content Length	4k
GQA	O
Tokens	>15B*
Learning Rate	5e^-5

Training Corpus

The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is as follows:

AI Hub: corpus/AI_HUB
- Only the Training segment of the data was used.
- The Validation and Test segments were deliberately excluded.
Modu Corpus: corpus/MODU_CORPUS

The final JSONL dataset used to train this model is approximately 61GB in size. Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original SOLAR tokenizer, >60 billion tokens.)

Vocab Expansion

Model Name	Vocabulary Size	Description
Original Solar	32000	Sentencepiece BPE
Expanded SOLAR-KO-10.7B	46592	Sentencepiece BPE. Added Korean vocab and merges

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

SOLAR-10.7B: 26 tokens
SOLAR-KO-10.7b: 8 tokens

Model	Tokens
SOLAR-10.7B	`['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']`
SOLAR-KO-10.7B	`['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.']`

Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

SOLAR-10.7B: 22 tokens
SOLAR-KO-10.7b: 22 tokens

Model	Tokens
SOLAR-10.7B	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`
SOLAR-KO-10.7B	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`

Model Benchmark

LM Eval Harness - Korean (polyglot branch)

Used EleutherAI's lm - evaluation - harness https://github.com/EleutherAI/lm - evaluation - harness/tree/polyglot

	0	5	10	50
kobest_boolq (macro_f1)	0.853949	0.88098	0.898139	0.902354
kobest_copa (macro_f1)	0.804531	0.826736	0.837656	0.860899
kobest_hellaswag (macro_f1)	0.507174	0.500983	0.487287	0.512182
kobest_sentineg (macro_f1)	0.3517	0.972291	0.977321	0.984884
kohatespeech (macro_f1)	0.258111	0.403957	0.386808	0.462393
kohatespeech_apeach (macro_f1)	0.337667	0.651697	0.705337	0.827757
kohatespeech_gen_bias (macro_f1)	0.124535	0.503464	0.498501	0.443218
korunsmile (f1)	0.3814	0.356939	0.369989	0.296193
nsmc (acc)	0.5356	0.87162	0.88654	0.89632
pawsx_ko (acc)	0.5435	0.5245	0.5315	0.5385

Citation

@misc {solar_ko_junbum_2023,
    author       = { {L. Junbum} },
    title        = { Solar-Ko-10.7b },
    year         = 2024,
    url          = { https://huggingface.co/beomi/SOLAR-KO-10.7B },
    publisher    = { Hugging Face }
}

Acknowledgements

Training support was provided by the TPU Research Cloud program.
The training corpus includes data from AI Hub, Modu Corpus, and Korean Wikipedia.

📄 License

Apache 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご