Kogpt-j-350m Open-Source Korean Generation Model - Free Deployment to Facilitate Various Types of Korean Text Generation

Kogpt J 350m

Developed by heegyu

A Korean text generation model based on the GPT-J architecture with 350 million parameters, suitable for various Korean text generation tasks.

Large Language Model KoreanOpen Source License:MIT #Korean text generation #Multi-domain corpus training #GPT-J architecture

Downloads 123

Release Time : 12/28/2022

Model Overview

This model is a Korean text generation model based on the GPT-J architecture, specifically optimized for Korean text generation tasks, capable of producing fluent and coherent Korean text.

Model Features

High-performance Korean text generation

The model excels in various Korean text generation tasks, producing fluent and coherent text.

Large-scale training data

The model was trained on diverse Korean datasets, including news, dialogues, and Wikipedia, totaling approximately 7 billion tokens.

Optimized training environment

The model was trained on TPU V2-8 with efficient training hyperparameters and scheduling strategies.

Model Capabilities

Korean text generation

Dialogue generation

News summarization

Use Cases

Dialogue systems

Daily dialogue generation

The model can generate natural and fluent daily dialogue text.

Example output: '안녕하세요?\n네.\n자~ 오늘 그~ 뭐~ 남북정상회담에서 인제 남북 관계와 관련된 발언이죠?'

News generation

News summarization

The model can generate news summaries based on prompts.

Example output: '오늘 정부 발표에 따르면, gtx-d d 노선을 창릉과 수서에서 출발하는 등 당초 예정된 노선들을 모두 정차하기로 했다.'

🚀 Kogpt-J-350m Model

Kogpt-J-350m is a text generation model based on the GPT - J architecture, trained on various Korean datasets to generate high - quality Korean text.

🚀 Quick Start

You can use the following code to quickly start using the model:

from transformers import pipeline

model_name = "heegyu/kogpt-j-350m"
pipe = pipeline('text-generation', model=model_name)

print(pipe("안녕하세요", repetition_penalty=1.2, do_sample=True, eos_token_id=1, early_stopping=True, max_new_tokens=128))
print(pipe("오늘 정부 발표에 따르면, ", repetition_penalty=1.2, do_sample=True, eos_token_id=1, early_stopping=True, max_new_tokens=128))
print(pipe("싸늘하다. 가슴에 비수가 날아와 꽂힌다. ", repetition_penalty=1.2, do_sample=True, eos_token_id=1, early_stopping=True, max_new_tokens=128, min_length=64))

✨ Features

Model Configuration

GPT - J (Flax, Pytorch)
20 Layers, 1024 hidden dim, 4096 intermediate, 16 heads, 51200 vocab size
1024 max_seq_len
Number of parameters: 350M

Performance Benchmark

🔧 Technical Details

Training Environment and Hyperparameters

TPU V2 - 8
Learning Rate: 3e - 4, Batch Size: 512(=64 accum x 8 devices), Scheduler: Linear, WarmUp: 1000 step
adam_beta1 = 0.9, adam_beta2 = 0.98, weight_decay = 0.01
Training Steps: 43247 (3 epoch)
Number of training tokens: 21.11B (43247 * 512 * 1024seq / 1024^3)
Training period: 2023/1/25 ~ 2023/1/29

Training Datasets

AIHub SNS Conversations (730MB)
AIHub Spoken Language (422MB)
AIHub Books (1.6MB)
AIHub Large - scale Web - based Korean Corpus (12GB)
Korean Wikipedia (867MB)
NamuWiki (6.4GB)
National Institute of Korean Language Messenger Conversations (21MB)
National Institute of Korean Language Daily Conversation Corpus (23MB)
National Institute of Korean Language Written Language Corpus (3.2GB)
National Institute of Korean Language Spoken Language Corpus (1.1GB)
National Institute of Korean Language Newspaper Corpus (~2022, 17GB)
Blue House Citizen Petitions (525MB)

The dataset size is based on the pre - processed jsonl files. The total number of tokens is approximately 7B.

💻 Usage Examples

Basic Usage

from transformers import pipeline

model_name = "heegyu/kogpt-j-350m"
pipe = pipeline('text-generation', model=model_name)

print(pipe("안녕하세요", repetition_penalty=1.2, do_sample=True, eos_token_id=1, early_stopping=True, max_new_tokens=128))
print(pipe("오늘 정부 발표에 따르면, ", repetition_penalty=1.2, do_sample=True, eos_token_id=1, early_stopping=True, max_new_tokens=128))
print(pipe("싸늘하다. 가슴에 비수가 날아와 꽂힌다. ", repetition_penalty=1.2, do_sample=True, eos_token_id=1, early_stopping=True, max_new_tokens=128, min_length=64))

Results:

[{'generated_text': '안녕하세요?\n네.\n자~ 오늘 그~ 뭐~ 남북정상회담에서 인제 남북 관계와 관련된 발언이죠?\n예. 그렇습니다.\n어~ 그~ 이산가족 문제 관련해서 이산가족 상봉을\n예.\n하는 방안이 좀 가능성이 있지 않아요?\n상당히 가능성이 있죠.\n예. 이~ 구체적으로 어떤 거였나요?\n어~ 먼저 이산가족 상봉을 이제 말씀드리겠습니다.\n예.\n아까 설명드린 것처럼 그~ 이산가족 상\n네.\n그~ 상봉에 대한 그~ 구체적인 방안이 어떻게 결정되는 게 가장 좋을까요?\n우선 상봉 방법부터 얘기를 드리죠.\n'}]
[{'generated_text': '오늘 정부 발표에 따르면, gtx-d d 노선을 창릉과 수서에서 출발하는 등 당초 예정된 노선들을 모두 정차하기로 했다. 지난 2월 국토교통부가 이 노선을 일산·금정·파주 운정역과 직접 연결키로 하면서 일산~동탄, 일산~분당, 일산~양재 구간에 추가 정차할 것이라는 예상이 나왔지만 실제 일산~수서 구간이 정차하기로 확정됐다. gtx-d 노선이 일산~수서역까지 개통되는 것은 이번이 처음이다.. gtx-d 노선과 gtx-a 노선이 모두 개통되면 지하철 5호선의 서울 도심 통과 구간이 추가된다. 현재 gtx-b'}]
[{'generated_text': '싸늘하다. 가슴에 비수가 날아와 꽂힌다. \U000f0854삼국사절요\U000f0855 ‘화살촉이 울버린’의 경우에서 보면, 총소리의 원음은 鐘(종자용 : 송악), 鐘을 비(鐘)라 하고 종자의 발음은 ‘이( )’이다. 이때에서 ‘이(은)로 시작하는 발음’은 ‘이/이’의 음운적 표현이다. ‘이/은→종자용[鐘] → 송악/종자[鐘]→이→종자(鐘) …’이다. 이는 한자어로서 그 발음'}]

📄 License

This project is licensed under the MIT license.

⚠️ Important Note

The training data of this model may contain various forms of discriminatory/hateful data, and no separate removal work has been carried out. Therefore, the sentences generated by the model may contain discriminatory/hateful remarks against specific individuals, races, genders, or disabilities.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご