ktdsbaseLM-v0.12 Open-Source Korean Large Language Model - Accurately Understand Korean Culture to Boost Natural Language Processing

Ktdsbaselm V0.12 Based On Openchat3.5

Developed by AIDX-ktds

ktdsbaseLM v0.11 is a large Korean language model based on OpenChat 3.5, focusing on understanding the Korean language and the diverse cultures of South Korea, and is suitable for various natural language processing tasks.

Large Language Model

Safetensors

KoreanOpen Source License:Apache-2.0 #Understanding of Korean culture #Korean social values #Fine-tuning of Mistral 7B

Downloads 1,726

Release Time : 10/3/2024

Model Overview

This model uses self-produced Korean data to reflect the values and cultures of South Korean society and can be applied to various natural language processing tasks such as text generation, dialogue reasoning, document summarization, question answering, and sentiment analysis.

Model Features

Cultural understanding

Specifically designed for the Korean language and Korean culture, using self-produced Korean data from 135 domains to reflect the values and cultures of South Korean society.

High-performance architecture

Based on the Mistral 7B model with 7 billion parameters, using a lightweight structure to ensure fast inference speed and memory efficiency.

Multi-domain coverage

The training data covers 2.33 million pieces of data such as QnA, summarization, and classification, involving multiple domains such as South Korean history, society, finance, law, taxation, mathematics, biology, physics, and chemistry.

Model Capabilities

Text generation

Dialogue reasoning

Document summarization

Question answering system

Sentiment analysis

Multi-domain knowledge processing

Use Cases

Education field

Q&A for learning materials

Conduct Q&A and generate explanations for various learning materials such as history, mathematics, and science.

Business field

Legal and financial consultation

Provide answers and document summaries for legal, financial, and tax-related questions.

Research and cultural field

Culture-related NLP tasks

Perform natural language processing tasks, sentiment analysis, document generation, and translation in line with South Korean society and culture.

Customer service

Personalized dialogue generation

Generate dialogues with users and provide personalized responses.

🚀 KTDSbaseLM v0.11

KTDSbaseLM v0.11 is a model developed based on the OpenChat 3.5 model. It aims to understand Korean and various cultural contexts in Korea by leveraging self - produced Korean data, reflecting the values and culture of Korean society.

✨ Features

Model Name and Key Features: KTDSbaseLM v0.11 is a Mistral 7B / openchat3.5 - based model fine - tuned using the SFT method on the OpenChat 3.5 model. It is designed to understand Korean and various cultural contexts in Korea. By utilizing self - produced Korean data from 135 domains, it reflects the values and culture of Korean society. Its main functions include text generation, conversation inference, document summarization, question - answering, sentiment analysis, and other natural language processing - related tasks. It can be applied in various fields such as law, finance, science, education, business, and cultural research.
Model Architecture: KTDSBaseLM v0.11 is a high - performance language model with 7 billion parameters based on the Mistral 7B model. Using OpenChat 3.5 as the foundation model, it is trained through the SFT (Supervised Fine - Tuning) method to perform well in Korean language and culture. The lightweight structure of Mistral 7B ensures fast inference speed and memory efficiency and is optimized for various natural language processing tasks, showing excellent performance in tasks like text generation, question - answering, document summarization, and sentiment analysis.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import os
import os.path as osp
import sys
import fire
import json
from typing import List, Union
import pandas as pd
import torch
from torch.nn import functional as F

import transformers
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer

from datasets import load_dataset

from peft import (
    LoraConfig,
    get_peft_model,
    set_peft_model_state_dict
)
from peft import PeftModel
import re
import ast

device = 'auto' #@param {type: "string"}
model = '' #@param {type: "string"}
model = AutoModelForCausalLM.from_pretrained(
    model,
    quantization_config=bnb_config,
    #load_in_4bit=True, # Quantization Load
    device_map=device)

tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)

input_text = "안녕하세요."
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

with torch.no_grad():
    outputs = model.generate(**inputs, max_length=1024)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)

Advanced Usage

No advanced usage code examples are provided in the original document, so this part is skipped.

📚 Documentation

Training Data: KTDSbaseLM v0.11 was trained on a total of 3.6GB of self - developed data, including 2.33 million pieces of data such as Q&A, summarization, and classification. Among them, 1.33 million are multiple - choice questions from 53 domains (including Korean history, society, finance, law, taxation, mathematics, biology, physics, chemistry, etc.) and were trained using the Chain of Thought method. Also, 1.3 million subjective questions cover 38 domains such as Korean history, finance, law, taxation, and mathematics. The model was trained on data that can understand the social values and human emotions in Korea and generate outputs according to given instructions.
Training Instruction Datasets Format:

{"prompt": "prompt text", "completion": "ideal generated text"}

Use Cases: KTDSbaseLM v0.11 can be used in various application fields. For example:
- Education: Answering questions and generating explanations for various learning materials in history, math, science, etc.
- Business: Providing answers to legal, financial, and tax - related questions and summarizing documents.
- Research and Culture: Performing natural language processing tasks, sentiment analysis, document generation, and translation suitable for Korean society and culture.
- Customer Service: Generating conversations with users and providing personalized responses.
Limitations: KTDSBaseLM v0.11 is specialized in the Korean language and Korean culture. However, due to the lack of data in specific areas (e.g., the latest international materials, professional fields), the accuracy of responses to other languages or cultures may be low. It may also show limited reasoning ability for problems requiring complex logical thinking, and there is a possibility of generating biased responses if the training data contains biases.

🔧 Technical Details

No additional technical details beyond what's already covered are provided in the original document, so this section is skipped.

📄 License

The model is under the apache - 2.0 license.

Additional Information

KTDS plans to provide fine - tuned LLMs (Large Language Models) across various domains of Korean culture and knowledge, including models based on not only OpenChat but also LLaMA, Polyglot, and EEVE. These models will be tailored to better understand and generate content specific to Korean contexts.

Property	Details
Base Model	openchat/openchat_3.5
Language	Korean
License	apache - 2.0
Pipeline Tag	text - generation
Datasets	AIDX - ktds/ko_leaderboard

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご