RAG-Specialized-LLM Open-source Korean Large Model - Focusing on RAG Tasks and Generating Structured Answers with References

RAG Specialized LLM

Developed by Surromind

A Korean large language model fine-tuned based on Qwen2.5-14B, specializing in RAG (Retrieval-Augmented Generation) tasks, capable of generating structured responses with source citations.

Large Language Model

Safetensors

KoreanOpen Source License:Apache-2.0 #Korean RAG Optimization #JSON Structured Output #Multi-document Citation Annotation

Downloads 52

Release Time : 3/21/2025

Model Overview

This model is optimized for RAG services, capable of analyzing input documents and generating responses with accurate source citations in structured JSON format. Particularly suitable for Q&A scenarios requiring credible source information.

Model Features

Structured JSON Output

Automatically generates standardized JSON format output containing relevant documents, source citations, and answers.

Source Annotation

Precisely annotates citation sources in responses, using <co: doc_id> tags to mark referenced paragraphs.

Multi-document Analysis

Capable of analyzing multiple related documents simultaneously and integrating information to generate comprehensive responses.

Korean Optimization

Specifically optimized for Korean text understanding and generation.

Model Capabilities

Text Generation

Q&A Systems

Document Analysis

Source Citation

Structured Output

Use Cases

Enterprise Knowledge Base

Internal Document Q&A

Quickly generates professional responses with source citations based on internal corporate documents.

Enhances information credibility and traceability.

Customer Service

Product FAQ Generation

Automatically generates customer Q&A with source citations based on product documentation.

Reduces manual customer service workload while ensuring answer accuracy.

Education & Research

Academic Literature Q&A

Generates explanatory responses with precise citations based on research papers.

Assists researchers in quickly obtaining key information.

🚀 RAG-Specialized-LLM

This is a model that performs full fine-tuning on the Qwen2.5 14B model using self-built RAG-specific datasets, CoT datasets, and benchmark datasets with the command r plus model. It can generate accurate answers and answer sources for the input data of general RAG services and outputs answers in JSON format.

🚀 Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Surromind/RAG-Specialized-LLM"
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """ Question: Apart from NAOG, which Mongolian people have participated in the training programs of the Local Administration Training Institute of the Ministry of the Interior?\n\n doc_id : 44365 / title : Korea's local talent development strategy is transferred to Vietnam. / content : Korea's local talent development strategy is transferred to Vietnam.\n□ Local government officials from Lang Son Province, Vietnam, came to Korea to learn about Korea's human resource development strategies such as the civil servant recruitment and education training system. \n○ The Local Administration Training Institute of the Ministry of the Interior (Director ***, hereinafter referred to as the Training Institute) will jointly operate the "Capacity Building for Vietnamese Local Government Officials from Lan Son Province" program with the Korea International Cooperation Agency (KOICA) for two weeks from November 26th to December 9th for 15 local government officials from Lang Son Province, Vietnam. \n□ Since most of the trainees are local government officials from Lang Son Province, Vietnam, this program is designed as a customized training program including lectures on local administration, civil servant recruitment and education, and local economic activation, as well as on-site visits, as requested by the local government. \n○ In particular, in order to strengthen the leadership and capabilities of local government officials in Vietnam, trainees will be required to establish an Action Plan through a discussion-style seminar on the civil servant recruitment and education training system, so that they can apply it to the formulation of human resource development policies in Lang Son Province. \n○ In addition, the training group will visit the Geumju-gun Base Farmer Processing Center and the Local Economic Circulation Center, which are evaluated as successful cases of increasing agricultural income and activating the local economy, to see the on-site agricultural product processing system that supports the stable sales of agricultural products produced by local farmers through secondary and tertiary food processing. \n○ In addition, trainees will have the opportunity to visit the Incheon Free Economic Zone Authority, which is of great interest to Lang Son Province, Vietnam, and experience how it can be applied to the local economy of Lang Son Province while visiting the on-site economic development situation in Korea. \n□ On the other hand, since 2006, the Training Institute has been operating training programs for local government officials in Vietnam. After 5 Vietnam programs and other multinational programs, a total of 130 trainees have graduated. doc_id : 45112 / title : "We came to learn about the innovation cases of Korean public enterprises!" / content : A delegation of professors and senior officials from the National Academy of Governance (NAOG) of Mongolia visited Korea to "learn about the innovation cases of Korean public enterprises!" - The Local Administration Training Institute has been implementing customized education for Mongolia for the 13th year. \n□ The Local Administration Training Institute of the Ministry of the Interior (Director Choi Doo-young, hereinafter referred to as the Training Institute) will operate the "Mongolian NAOG* Capacity Building Program" from March 1st to March 8th. \n○ 14 people including professors, senior officials, and training-related officials will participate in this program. \n* NAOG (National Academy of Governance): The largest educational institution in Mongolia that educates opinion leaders in Mongolia, including civil servants, politicians, and civilians, and awards master's and doctoral degrees. \n□ After signing a MOU on exchange and cooperation with the Mongolian NAOG in 2002, the Training Institute has operated 13 training programs (such as administrative reform, economic development strategies, and measures to improve administrative transparency), graduating 158 NAOG professors and senior officials as alumni. \n○ In addition, the Training Institute has been operating various training programs such as the Mongolian Governor Training Program, which allows 1,310 local government officials such as governors and county heads in Mongolia to benchmark the excellent cases of Korean local administration."""
messages = [
    {
        "role": "system",
        "content": """As an interactive AI, your main role is to provide reliable information in response to users' questions. You need to accurately understand users' requirements and analyze relevant documents to generate the best answers. \nYou must follow the following principles:\n1. Always prioritize users' requests and provide clear and easy-to-understand answers. \n2. Make the most of the provided documents to construct responses, but improve the quality of responses through additional analysis and logic. \n3. When generating responses, you must follow the given instructions and provide clear sources. \n4. If users' questions are ambiguous, you can consider rephrasing the questions to ensure clarity. \n\n# User Guide\n## Tasks and Context\nYou need to analyze relevant documents in response to users' questions and generate responses based on reliable information. It is important to provide information in the most appropriate form considering the context, rather than simply conveying information. \n\n## Style Guide\nPlease output your answers in JSON format. [{"related_document": {"doc_id found in document information"}, "source": {"doc_id found in document information": "Quoted sentences found in the corresponding document, presented as the original text", "doc_id found in document information": "Quoted sentences found in the corresponding document, presented as the original text"}, "answer": "A descriptive answer of 3 - 6 sentences without indicating the source", "grounded_answer": "The same as the answer, but with the citation source indicated by the <co: doc_id> and </co: doc_id> symbols"}]\n""",
    },
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

✨ Features

This model is fine-tuned on the Qwen2.5 14B model using self-built RAG-specific datasets, CoT datasets, and benchmark datasets. It can generate accurate answers and answer sources for the input data of general RAG services and outputs answers in JSON format. The output key values are as follows:

The value of "related_document": The doc_id and title of the document related to the question (key: document number, Value: document title).
"source": The doc_id of the document related to the question and the quoted sentences generated in the answer.
The value of "answer": A descriptive answer of 3 - 6 sentences without indicating the source.
The value of "grounded_answer": The same as the answer, but with the citation source indicated by the <co: doc_id> and </co: doc_id> symbols.

Example of answer output

{
    "related_document": {
        "D0000042284685": "Measures to promote fire prevention for electric tricycles in Garak Market",
        "4895": "Next-generation high-reliability and high-output supercapacitors"
    },
    "source": {
        "D0000042284685": "「Charging devices for logistics and transportation equipment (lithium-ion batteries) ...",
        "4895": "Comparison between supercapacitors and lithium secondary batteries ..."
    },
    "answer": "The lithium-ion batteries and supercapacitors of electric tricycles in Garak Market have differences in...",
    "grounded_answer": "The lithium-ion batteries and supercapacitors of electric tricycles in Garak Market have differences in <co: 4895>mechanism, materials, lifespan, protection circuits, polarity, overvoltage, residual capacity measurement, characteristics</co: 4895>, etc. Lithium-ion batteries have a <co: 4895>lithium-ion movement mechanism</co: 4895>..."
}

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

📚 Documentation

RAG Prompt

RAG_PROMPT = """<|im_start|>system\n\n As an interactive AI, your main role is to provide reliable information in response to users' questions. You need to accurately understand users' requirements and analyze relevant documents to generate the best answers. \nYou must follow the following principles:\n1. Always prioritize users' requests and provide clear and easy-to-understand answers. \n2. Make the most of the provided documents to construct responses, but improve the quality of responses through additional analysis and logic. \n3. When generating responses, you must follow the given instructions and provide clear sources. \n4. If users' questions are ambiguous, you can consider rephrasing the questions to ensure clarity. \n\n# User Guide\n## Tasks and Context\nYou need to analyze relevant documents in response to users' questions and generate responses based on reliable information. It is important to provide information in the most appropriate form considering the context, rather than simply conveying information. \n\n## Style Guide\nPlease output your answers in JSON format. [{"related_document": {"doc_id found in document information"}, "source": {"doc_id found in document information": "Quoted sentences found in the corresponding document, presented as the original text", "doc_id found in document information": "Quoted sentences found in the corresponding document, presented as the original text"}, "answer": "A descriptive answer of 3 - 6 sentences without indicating the source", "grounded_answer": "The same as the answer, but with the citation source indicated by the <co: doc_id> and </co: doc_id> symbols"}]\n<|im_end|>\n<|im_start|>user\n {instruction}  <|im_end|>\n<|im_start|>assistant\n"""

Training Environment and Parameters

Tuning environment:
- H100 (80GB) * 8
Parameters:
- tokenizer_model_mex_length: 4500
- use_flash_attn: True
- num_train_epochs: 3.0
- weight_decay: 0.001
- lr_scheduler_type: "linear"
- per_device_train_batch_size: 1
- gradient_accumulation_steps: 64
- learning_rate: 5e-06
- bf16: True
- deepspeed: ds_stage2.json

Datasets Used

AIhub 16 administrative document-based machine reading comprehension data
AIhub 17 news article machine reading comprehension data
AIhub 21 book material machine reading comprehension
AIhub 149 table information Q&A data
AIhub 150 numerical operation machine reading comprehension data
AIhub 151 financial and legal document machine reading comprehension data
kyujinpy/KoCoT_2000
MarkrAI/KoCommercial-Dataset
CarrotAI/ko-instruction-dataset
heegyu/CoT-collection-ko

Contact Us

Surromind
2nd Floor, 1802, Nambu Circulation Road, Gwanak-gu, Seoul
02 - 872 - 5127
contact@surromind.ai

🔧 Technical Details

The model is based on the Qwen/Qwen2.5 - 14B base model. It is fine-tuned using self-built RAG-specific datasets, CoT datasets, and benchmark datasets. The training is carried out in a specific environment with a series of parameter settings to achieve the goal of generating accurate answers and answer sources for the input data of general RAG services and outputting answers in JSON format.

📄 License

The model is licensed under the Apache - 2.0 license.

Property	Details
Model Type	RAG-Specialized-LLM
Base Model	Qwen/Qwen2.5-14B
Tags	RAG, Ko-LLM, QA
Datasets	kyujinpy/KoCoT_2000, MarkrAI/KoCommercial-Dataset, CarrotAI/ko-instruction-dataset, heegyu/CoT-collection-ko, AIhub 16 administrative document-based machine reading comprehension data, AIhub 17 news article machine reading comprehension data, AIhub 21 book material machine reading comprehension, AIhub 149 table information Q&A data, AIhub 150 numerical operation machine reading comprehension data, AIhub 151 financial and legal document machine reading comprehension data
Pipeline Tag	text-generation
License	Apache-2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご