RAG Specialized LLM
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 RAG-Specialized-LLM
This is a model that performs full fine-tuning on the Qwen2.5 14B model using self-built RAG-specific datasets, CoT datasets, and benchmark datasets with the command r plus model. It can generate accurate answers and answer sources for the input data of general RAG services and outputs answers in JSON format.
🚀 Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Surromind/RAG-Specialized-LLM"
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = """ Question: Apart from NAOG, which Mongolian people have participated in the training programs of the Local Administration Training Institute of the Ministry of the Interior?\n\n doc_id : 44365 / title : Korea's local talent development strategy is transferred to Vietnam. / content : Korea's local talent development strategy is transferred to Vietnam.\n□ Local government officials from Lang Son Province, Vietnam, came to Korea to learn about Korea's human resource development strategies such as the civil servant recruitment and education training system. \n○ The Local Administration Training Institute of the Ministry of the Interior (Director ***, hereinafter referred to as the Training Institute) will jointly operate the "Capacity Building for Vietnamese Local Government Officials from Lan Son Province" program with the Korea International Cooperation Agency (KOICA) for two weeks from November 26th to December 9th for 15 local government officials from Lang Son Province, Vietnam. \n□ Since most of the trainees are local government officials from Lang Son Province, Vietnam, this program is designed as a customized training program including lectures on local administration, civil servant recruitment and education, and local economic activation, as well as on-site visits, as requested by the local government. \n○ In particular, in order to strengthen the leadership and capabilities of local government officials in Vietnam, trainees will be required to establish an Action Plan through a discussion-style seminar on the civil servant recruitment and education training system, so that they can apply it to the formulation of human resource development policies in Lang Son Province. \n○ In addition, the training group will visit the Geumju-gun Base Farmer Processing Center and the Local Economic Circulation Center, which are evaluated as successful cases of increasing agricultural income and activating the local economy, to see the on-site agricultural product processing system that supports the stable sales of agricultural products produced by local farmers through secondary and tertiary food processing. \n○ In addition, trainees will have the opportunity to visit the Incheon Free Economic Zone Authority, which is of great interest to Lang Son Province, Vietnam, and experience how it can be applied to the local economy of Lang Son Province while visiting the on-site economic development situation in Korea. \n□ On the other hand, since 2006, the Training Institute has been operating training programs for local government officials in Vietnam. After 5 Vietnam programs and other multinational programs, a total of 130 trainees have graduated. doc_id : 45112 / title : "We came to learn about the innovation cases of Korean public enterprises!" / content : A delegation of professors and senior officials from the National Academy of Governance (NAOG) of Mongolia visited Korea to "learn about the innovation cases of Korean public enterprises!" - The Local Administration Training Institute has been implementing customized education for Mongolia for the 13th year. \n□ The Local Administration Training Institute of the Ministry of the Interior (Director Choi Doo-young, hereinafter referred to as the Training Institute) will operate the "Mongolian NAOG* Capacity Building Program" from March 1st to March 8th. \n○ 14 people including professors, senior officials, and training-related officials will participate in this program. \n* NAOG (National Academy of Governance): The largest educational institution in Mongolia that educates opinion leaders in Mongolia, including civil servants, politicians, and civilians, and awards master's and doctoral degrees. \n□ After signing a MOU on exchange and cooperation with the Mongolian NAOG in 2002, the Training Institute has operated 13 training programs (such as administrative reform, economic development strategies, and measures to improve administrative transparency), graduating 158 NAOG professors and senior officials as alumni. \n○ In addition, the Training Institute has been operating various training programs such as the Mongolian Governor Training Program, which allows 1,310 local government officials such as governors and county heads in Mongolia to benchmark the excellent cases of Korean local administration."""
messages = [
{
"role": "system",
"content": """As an interactive AI, your main role is to provide reliable information in response to users' questions. You need to accurately understand users' requirements and analyze relevant documents to generate the best answers. \nYou must follow the following principles:\n1. Always prioritize users' requests and provide clear and easy-to-understand answers. \n2. Make the most of the provided documents to construct responses, but improve the quality of responses through additional analysis and logic. \n3. When generating responses, you must follow the given instructions and provide clear sources. \n4. If users' questions are ambiguous, you can consider rephrasing the questions to ensure clarity. \n\n# User Guide\n## Tasks and Context\nYou need to analyze relevant documents in response to users' questions and generate responses based on reliable information. It is important to provide information in the most appropriate form considering the context, rather than simply conveying information. \n\n## Style Guide\nPlease output your answers in JSON format. [{"related_document": {"doc_id found in document information"}, "source": {"doc_id found in document information": "Quoted sentences found in the corresponding document, presented as the original text", "doc_id found in document information": "Quoted sentences found in the corresponding document, presented as the original text"}, "answer": "A descriptive answer of 3 - 6 sentences without indicating the source", "grounded_answer": "The same as the answer, but with the citation source indicated by the <co: doc_id> and </co: doc_id> symbols"}]\n""",
},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
✨ Features
This model is fine-tuned on the Qwen2.5 14B model using self-built RAG-specific datasets, CoT datasets, and benchmark datasets. It can generate accurate answers and answer sources for the input data of general RAG services and outputs answers in JSON format. The output key values are as follows:
- The value of "related_document": The doc_id and title of the document related to the question (key: document number, Value: document title).
- "source": The doc_id of the document related to the question and the quoted sentences generated in the answer.
- The value of "answer": A descriptive answer of 3 - 6 sentences without indicating the source.
- The value of "grounded_answer": The same as the answer, but with the citation source indicated by the <co: doc_id> and </co: doc_id> symbols.
Example of answer output
{
"related_document": {
"D0000042284685": "Measures to promote fire prevention for electric tricycles in Garak Market",
"4895": "Next-generation high-reliability and high-output supercapacitors"
},
"source": {
"D0000042284685": "「Charging devices for logistics and transportation equipment (lithium-ion batteries) ...",
"4895": "Comparison between supercapacitors and lithium secondary batteries ..."
},
"answer": "The lithium-ion batteries and supercapacitors of electric tricycles in Garak Market have differences in...",
"grounded_answer": "The lithium-ion batteries and supercapacitors of electric tricycles in Garak Market have differences in <co: 4895>mechanism, materials, lifespan, protection circuits, polarity, overvoltage, residual capacity measurement, characteristics</co: 4895>, etc. Lithium-ion batteries have a <co: 4895>lithium-ion movement mechanism</co: 4895>..."
}
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
📚 Documentation
RAG Prompt
RAG_PROMPT = """<|im_start|>system\n\n As an interactive AI, your main role is to provide reliable information in response to users' questions. You need to accurately understand users' requirements and analyze relevant documents to generate the best answers. \nYou must follow the following principles:\n1. Always prioritize users' requests and provide clear and easy-to-understand answers. \n2. Make the most of the provided documents to construct responses, but improve the quality of responses through additional analysis and logic. \n3. When generating responses, you must follow the given instructions and provide clear sources. \n4. If users' questions are ambiguous, you can consider rephrasing the questions to ensure clarity. \n\n# User Guide\n## Tasks and Context\nYou need to analyze relevant documents in response to users' questions and generate responses based on reliable information. It is important to provide information in the most appropriate form considering the context, rather than simply conveying information. \n\n## Style Guide\nPlease output your answers in JSON format. [{"related_document": {"doc_id found in document information"}, "source": {"doc_id found in document information": "Quoted sentences found in the corresponding document, presented as the original text", "doc_id found in document information": "Quoted sentences found in the corresponding document, presented as the original text"}, "answer": "A descriptive answer of 3 - 6 sentences without indicating the source", "grounded_answer": "The same as the answer, but with the citation source indicated by the <co: doc_id> and </co: doc_id> symbols"}]\n<|im_end|>\n<|im_start|>user\n {instruction} <|im_end|>\n<|im_start|>assistant\n"""
Training Environment and Parameters
- Tuning environment:
- H100 (80GB) * 8
- Parameters:
- tokenizer_model_mex_length: 4500
- use_flash_attn: True
- num_train_epochs: 3.0
- weight_decay: 0.001
- lr_scheduler_type: "linear"
- per_device_train_batch_size: 1
- gradient_accumulation_steps: 64
- learning_rate: 5e-06
- bf16: True
- deepspeed: ds_stage2.json
Datasets Used
- AIhub 16 administrative document-based machine reading comprehension data
- AIhub 17 news article machine reading comprehension data
- AIhub 21 book material machine reading comprehension
- AIhub 149 table information Q&A data
- AIhub 150 numerical operation machine reading comprehension data
- AIhub 151 financial and legal document machine reading comprehension data
- kyujinpy/KoCoT_2000
- MarkrAI/KoCommercial-Dataset
- CarrotAI/ko-instruction-dataset
- heegyu/CoT-collection-ko
Contact Us
- Surromind
- 2nd Floor, 1802, Nambu Circulation Road, Gwanak-gu, Seoul
- 02 - 872 - 5127
- contact@surromind.ai
🔧 Technical Details
The model is based on the Qwen/Qwen2.5 - 14B base model. It is fine-tuned using self-built RAG-specific datasets, CoT datasets, and benchmark datasets. The training is carried out in a specific environment with a series of parameter settings to achieve the goal of generating accurate answers and answer sources for the input data of general RAG services and outputting answers in JSON format.
📄 License
The model is licensed under the Apache - 2.0 license.
Property | Details |
---|---|
Model Type | RAG-Specialized-LLM |
Base Model | Qwen/Qwen2.5-14B |
Tags | RAG, Ko-LLM, QA |
Datasets | kyujinpy/KoCoT_2000, MarkrAI/KoCommercial-Dataset, CarrotAI/ko-instruction-dataset, heegyu/CoT-collection-ko, AIhub 16 administrative document-based machine reading comprehension data, AIhub 17 news article machine reading comprehension data, AIhub 21 book material machine reading comprehension, AIhub 149 table information Q&A data, AIhub 150 numerical operation machine reading comprehension data, AIhub 151 financial and legal document machine reading comprehension data |
Pipeline Tag | text-generation |
License | Apache-2.0 |

