llama3.1_korean_v0.1_sft_by_aidx Open-source Text Generation Model - Adapted to Korean and Korean Multicultural Scenarios

Llama3.1 Korean V0.1 Sft By Aidx

Developed by SEOKDONG

A text generation model developed based on the Llama3.1 instruct base model and optimized for Korean and diverse Korean cultural scenarios

Large Language Model

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Korean instruction optimization #Understanding of Korean culture #Multi-domain QA

Downloads 1,592

Release Time : 9/29/2024

Model Overview

This model is trained with Korean data from 53 domains, can understand Korean social values and culture, and is suitable for natural language processing tasks in multiple fields such as education, business, and research

Model Features

Optimized for Korean culture

Specifically trained and optimized for Korean social values and diverse cultural scenarios

Multi-domain coverage

The training data covers 53 domains, including history, finance, law, science, etc.

Chain of thought training

Trained using the Chain of Thought method to improve reasoning ability

Model Capabilities

Korean text generation

Multi-domain Q&A

Document summarization

Sentiment analysis

Legal and financial consultation

Use Cases

Education field

Q&A on learning materials

Conduct Q&A and generate explanations for various learning materials such as history, mathematics, and science

Business field

Legal and financial consultation

Answer questions related to law, finance, and taxation

Document processing

Perform document summarization and generation

Research and culture

Culture-related NLP tasks

Perform natural language processing tasks in line with Korean society and culture

Customer service

Customized dialogue

Generate dialogues with users and provide customized responses

🚀 Llama3.1 Korean Model

This model is developed based on Llama3.1 Instruct, aiming to be applicable to Korean language and various Korean cultures. It utilizes self - produced Korean data from 53 domains to understand Korean social values and cultures. Thanks for ktds.

📚 Documentation

Model Overview

This model is built upon the Llama3.1 Instruct as its foundation model. It is developed to be applicable to the Korean language and various aspects of Korean culture. It is a model that understands Korean social values and cultures by leveraging self - produced Korean data from 53 domains.

Training Data

This model is trained on a self - developed dataset with a total size of 3.6GB. It includes a total of 2.33 million pieces of data such as Q&A, summarization, and classification. Among them, 1.33 million pieces are composed of multiple - choice questions from 53 domains, including Korean history, society, finance, law, taxation, mathematics, biology, physics, chemistry, etc., and are trained in the Chain of Thought manner. Also, 1.3 million pieces of subjective questions are trained across 38 domains such as Korean history, finance, law, taxation, and mathematics. The model has learned data that can understand Korean social values and human emotions and generate outputs according to given instructions.
Training Instruction Datasets Format:

{"prompt": "prompt text", "completion": "ideal generated text"}

Use Cases

This model can be used in a variety of application fields. For example:

Education: Generating Q&A and explanations for various learning materials in history, mathematics, science, etc.
Business: Providing answers to legal, financial, and tax - related inquiries and summarizing documents.
Research and Culture: Natural language processing tasks tailored to Korean society and culture, sentiment analysis, document generation, and translation.
Customer Service: Generating conversations with users and providing customized responses. This model has high applicability in various natural language processing tasks.

Limitations

Although this model is specialized in the Korean language and Korean culture, due to the lack of data in specific domains (e.g., the latest international materials, professional fields), the accuracy of responses to other languages or cultures may be low. Also, it may show limited reasoning ability for problems that require complex logical thinking, and there is a possibility of generating biased responses if the training data contains biases.

Usage Examples

Basic Usage

from transformers import AutoModel, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("SEOKDONG/llama3.1_korean_v0.1_sft_by_aidx")
model = AutoModel.from_pretrained("SEOKDONG/llama3.1_korean_v0.1_sft_by_aidx")

input_text =  """ 「국민건강보험법」제44조, 「국민건강보험법 시행령」제19조,「약관의 규제에 관한 법률」제5조, 「상법」제54조 참조 판단 해줘"""
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(**inputs, max_length=1024,  temperature=0.5, do_sample=True, repetition_penalty=1.15)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

📄 License

This model is under the Apache - 2.0 license.

📋 Information Table

Property	Details
Model Type	Text Generation
Base Model	meta - llama/Llama - 3.1 - 8B - Instruct
Training Datasets	AIDX - ktds/ko_leaderboard
Tags	ktds, ko, ko_leaderboard, korean

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご