đ KTDSbaseLM v0.11
KTDSbaseLM v0.11 is a model developed based on the OpenChat 3.5 model. It aims to understand Korean and various cultural contexts in Korea by leveraging self - produced Korean data, reflecting the values and culture of Korean society.
⨠Features
- Model Name and Key Features: KTDSbaseLM v0.11 is a Mistral 7B / openchat3.5 - based model fine - tuned using the SFT method on the OpenChat 3.5 model. It is designed to understand Korean and various cultural contexts in Korea. By utilizing self - produced Korean data from 135 domains, it reflects the values and culture of Korean society. Its main functions include text generation, conversation inference, document summarization, question - answering, sentiment analysis, and other natural language processing - related tasks. It can be applied in various fields such as law, finance, science, education, business, and cultural research.
- Model Architecture: KTDSBaseLM v0.11 is a high - performance language model with 7 billion parameters based on the Mistral 7B model. Using OpenChat 3.5 as the foundation model, it is trained through the SFT (Supervised Fine - Tuning) method to perform well in Korean language and culture. The lightweight structure of Mistral 7B ensures fast inference speed and memory efficiency and is optimized for various natural language processing tasks, showing excellent performance in tasks like text generation, question - answering, document summarization, and sentiment analysis.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
import os
import os.path as osp
import sys
import fire
import json
from typing import List, Union
import pandas as pd
import torch
from torch.nn import functional as F
import transformers
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from peft import (
LoraConfig,
get_peft_model,
set_peft_model_state_dict
)
from peft import PeftModel
import re
import ast
device = 'auto'
model = ''
model = AutoModelForCausalLM.from_pretrained(
model,
quantization_config=bnb_config,
device_map=device)
tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)
input_text = "ėë
íė¸ė."
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")
with torch.no_grad():
outputs = model.generate(**inputs, max_length=1024)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
Advanced Usage
No advanced usage code examples are provided in the original document, so this part is skipped.
đ Documentation
- Training Data: KTDSbaseLM v0.11 was trained on a total of 3.6GB of self - developed data, including 2.33 million pieces of data such as Q&A, summarization, and classification. Among them, 1.33 million are multiple - choice questions from 53 domains (including Korean history, society, finance, law, taxation, mathematics, biology, physics, chemistry, etc.) and were trained using the Chain of Thought method. Also, 1.3 million subjective questions cover 38 domains such as Korean history, finance, law, taxation, and mathematics. The model was trained on data that can understand the social values and human emotions in Korea and generate outputs according to given instructions.
- Training Instruction Datasets Format:
{"prompt": "prompt text", "completion": "ideal generated text"}
- Use Cases: KTDSbaseLM v0.11 can be used in various application fields. For example:
- Education: Answering questions and generating explanations for various learning materials in history, math, science, etc.
- Business: Providing answers to legal, financial, and tax - related questions and summarizing documents.
- Research and Culture: Performing natural language processing tasks, sentiment analysis, document generation, and translation suitable for Korean society and culture.
- Customer Service: Generating conversations with users and providing personalized responses.
- Limitations: KTDSBaseLM v0.11 is specialized in the Korean language and Korean culture. However, due to the lack of data in specific areas (e.g., the latest international materials, professional fields), the accuracy of responses to other languages or cultures may be low. It may also show limited reasoning ability for problems requiring complex logical thinking, and there is a possibility of generating biased responses if the training data contains biases.
đ§ Technical Details
No additional technical details beyond what's already covered are provided in the original document, so this section is skipped.
đ License
The model is under the apache - 2.0 license.
Additional Information
KTDS plans to provide fine - tuned LLMs (Large Language Models) across various domains of Korean culture and knowledge, including models based on not only OpenChat but also LLaMA, Polyglot, and EEVE. These models will be tailored to better understand and generate content specific to Korean contexts.
Property |
Details |
Base Model |
openchat/openchat_3.5 |
Language |
Korean |
License |
apache - 2.0 |
Pipeline Tag |
text - generation |
Datasets |
AIDX - ktds/ko_leaderboard |