DeepSeek-R1-Qwen2.5-1.5b-SFT-R1開源模型 - 免費實現非結構化文本到JSON結構化轉換

首頁

Deepseek R1 Qwen2.5 1.5b SFT R1 JSON Unstructured To Structured

由MasterControlAIML開發

本模型基於Qwen2架構，專門用於將非結構化文本按照預定義的JSON模式轉換為結構化輸出。

大型語言模型

Transformers

英語開源協議:Apache-2.0 #文本結構化轉換 #JSON模式映射 #文檔解析

下載量 399

發布時間 : 2/7/2025

模型概述

該模型經過微調，能夠將非結構化文本（如手冊、問答文檔）映射為結構化JSON格式，支持基於給定JSON模式的分層數據提取。

模型特點

高效結構化轉換

能夠將非結構化文本準確映射到預定義的JSON模式中

層級數據提取

支持具有層級關係的複雜JSON模式提取

快速推理

利用Unsloth庫實現高效模型推理

模型能力

非結構化文本解析

JSON模式匹配

層級數據提取

結構化輸出生成

使用案例

文檔處理

質量手冊結構化

將非結構化的質量保證手冊轉換為符合特定模式的JSON格式

生成標準化的文檔結構，便於後續處理和分析

數據標準化

問答數據轉換

將問答形式的非結構化數據轉換為結構化JSON

實現問答數據的標準化存儲和檢索

🚀 MasterControlAIML R1-Qwen2.5-1.5b SFT R1 JSON非結構化轉結構化模型

本項目提供了一個經過微調的Qwen2模型，該模型針對將非結構化文本轉換為符合預定義模式的結構化JSON輸出進行了優化。模型基於基礎模型 MasterControlAIML/DeepSeek-R1-Strategy-Qwen-2.5-1.5b-Unstructured-To-Structured 進行微調，並採用LoRA技術實現高效適配。

關鍵亮點：

開發者： bhaviktheslider

許可證： Apache-2.0

微調基礎： MasterControlAIML/DeepSeek-R1-Strategy-Qwen-2.5-1.5b-Unstructured-To-Structured

加速訓練： 使用 Unsloth 和Hugging Face的TRL庫，訓練速度提升2倍。

🚀 快速開始

本模型適用於需要將非結構化文本（如手冊、問答文檔）映射為結構化JSON格式的任務。它支持基於給定JSON模式進行分層數據提取，確保生成的輸出嚴格遵循模式定義的結構和規則。

使用Unsloth進行快速推理

Unsloth庫可讓您快速加載模型並進行推理。以下是一個基本示例：

from unsloth import FastLanguageModel
import torch

MODEL = "MasterControlAIML/DeepSeek-R1-Qwen2.5-1.5b-SFT-R1-JSON-Unstructured-To-Structured"

# 加載模型和分詞器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=False,
)

# 準備推理
FastLanguageModel.for_inference(model)

ALPACA_PROMPT = """
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}
"""

# 示例指令和提示
instruction = "" (see examples below)
prompt = ALPACA_PROMPT.format(instruction, "")
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=2000)

# 打印生成的文本
print(tokenizer.batch_decode(output, skip_special_tokens=True)[0])

使用Transformers進行推理

您也可以直接使用Hugging Face的Transformers庫：

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

MODEL = "MasterControlAIML/DeepSeek-R1-Qwen2.5-1.5b-SFT-R1-JSON-Unstructured-To-Structured"

# 初始化分詞器和模型
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.float16, device_map="auto")

ALPACA_PROMPT = """
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}
"""

TEXT = ""(see examples below)
prompt = ALPACA_PROMPT.format(TEXT, "")
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)

with torch.no_grad():
    output_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=2000,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        streamer=text_streamer,
        pad_token_id=tokenizer.pad_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

✨ 主要特性

高效推理： 利用 Unsloth 庫實現快速模型推理。
結構化輸出： 將文本輸入映射到具有分層關係的嚴格JSON模式中。
靈活集成： 示例代碼片段展示瞭如何使用Unsloth API和Hugging Face的Transformers。
高級提示： 以下提供了使用Alpaca提示模板和LangChain提示模板的詳細示例。

📦 安裝指南

前提條件

Python： 3.8+
PyTorch： （建議支持CUDA）
所需庫： transformers、torch、unsloth、langchain（用於高級用法）

安裝命令

使用以下命令安裝所需的Python包：

pip install torch transformers unsloth langchain

💻 使用示例

高級數據提取示例（使用LangChain）

from langchain_core.prompts import PromptTemplate

# 示例文本輸入，結構略有不同
TEXT1 = """
Quality Assurance Manual Overview

This document provides an introduction to the quality assurance procedures for manufacturing processes.

## Introduction to Manufacturing Processes

This section covers the basics of manufacturing processes. It includes definitions and a brief history.

| Name                | Description               |
|---------------------|---------------------------|
| Process 1           | Initial process in製造過程。|

### Detailed Process 1

This process refines material to meet specific standards.

| Parameter | Range    |
|-----------|----------|
| Temperature | 200-300°C |

**Operational Parameters**

The operational parameters are critical for the success of this process.

| Setting | Class      |
|---------|------------|
| Critical | High Precision |
"""

# 不同的JSON模式示例
SCHEMA1 = """
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "id": {
      "type": "string",
      "description": "Dot-separated integers representing the hierarchical id of the element."
    },
    "title": {
      "type": "string",
      "description": "Descriptive title of the section or component."
    },
    "level": {
      "type": "integer",
      "description": "Hierarchy level starting from 0 for the root."
    },
    "level_type": {
      "type": "string",
      "enum": ["ROOT", "SECTION", "SUBSECTION", "DETAIL_N"],
      "description": "Type of the hierarchal level."
    },
    "component": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "idc": {
            "type": "integer",
            "description": "Sequential unique component ID."
          },
          "component_type": {
            "type": "string",
            "enum": ["PARAGRAPH", "TABLE", "CALCULATION", "CHECKBOX"],
            "description": "Type of the component."
          },
          "metadata": {
            "type": "string",
            "description": "Additional metadata token (may be <title>, <note>, or <overview>)."
          },
          "properties": {
            "type": "object",
            "properties": {
              "variables": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "idx": {
                      "type": "string",
                      "description": "Unique identifier, X.Y (X represents row, Y represents column for Tables, 0 otherwise)."
                    },
                    "name": {
                      "type": "string",
                      "description": "Attribute name."
                    },
                    "value": {
                      "type": "string",
                      "description": "Attribute value."
                    },
                    "unit": {
                      "type": "string",
                      "description": "Optional unit."
                    },
                    "metrics": {
                      "type": "boolean",
                      "description": "Flag indicating if variable is a metric."
                    },
                    "formula": {
                      "type": "boolean",
                      "description": "Flag indicating if variable has an equation."
                    }
                  }
                }
              },
              "content": {
                "type": "array",
                "items": {
                  "type": "string"
                },
                "description": "Text content (e.g., notes, MCQs, questions, points, etc.)."
              }
            }
          }
        }
      }
    },
    "children": {
      "type": "array",
      "items": {
        "$ref": "#"
      },
      "description": "Nested elements."
    }
  },
  "required": ["id", "title", "level", "level_type", "component", "children"]
}
"""

# LangChain提示，引導模型
SYSTEM_PROMPT = """
### Role:
You are an expert data extractor specializing in mapping hierarchical text data into a given JSON Schema.

### DATA INPUT:
- **Text:** ```{TEXT}```  
- **Blank JSON Schema:** ```{SCHEMA}```  

### TASK REQUIREMENT:
1. Analyze the given text and map all relevant information strictly into the provided JSON Schema.
2. Provide your output in **two mandatory sections**:
   - **`<answer>`:** The filled JSON object  
   - **`<think>`:** Reasoning for the mapping decisions  

### OUTPUT STRUCTURE:

`<think> /* Explanation of mapping logic */ </think>`
`<answer> /* Completed JSON Object */ </answer>`



### STRICT RULES FOR GENERATING OUTPUT:
1. **Both Tags Required:**  
   - Always provide both the `<think>` and the `<answer>` sections.  
   - If reasoning is minimal, state: "Direct mapping from text to schema."
2. **JSON Schema Mapping:**  
   - Strictly map the text data to the given JSON Schema without modification or omissions.
3. **Hierarchy Preservation:**  
   - Maintain proper parent-child relationships and follow the schema's hierarchical structure.
4. **Correct Mapping of Attributes:**  
   - Map key attributes, including `id`, `idc`, `idx`, `level_type`, and `component_type`.
5. **JSON Format Compliance:**  
   - Escape quotes (`\"`), replace newlines with `\\n`, avoid trailing commas, and use double quotes exclusively.
6. **Step-by-Step Reasoning:**  
   - Explain your reasoning within the `<think>` tag.

### IMPORTANT:
If either the `<think>` or `<answer>` tags is missing, the response will be considered incomplete.
"""

from langchain_core.prompts import PromptTemplate
system_prompt_template = PromptTemplate(template=SYSTEM_PROMPT, input_variables=["TEXT", "SCHEMA"])

# 使用選定的文本和JSON模式格式化提示
system_prompt_str = system_prompt_template.format(TEXT=TEXT1, SCHEMA=SCHEMA1)

# 現在，將 `system_prompt_str` 傳入模型推理管道（使用Unsloth、Transformers等）
# 以獲取包含推理和映射後的JSON對象的輸出。

📚 詳細文檔

示例響應及思維鏈解釋

以下是兩個示例輸出，包含 思維鏈解釋 (<think>) 和對應的 填充後的JSON模式 (<answer>)。這些示例展示了模型應如何闡述推理過程並生成最終的映射模式。

示例響應1

<think>
思維鏈解釋：
為了根據文本填充空白模式並使其與填充後的模式完全匹配，我將從理解每個部分的含義開始。

首先，觀察空白模式，發現它有需要填充實際內容的佔位符。目標是用從文本中提取的適當數據替換這些佔位符，使完成後的模式與填充後的模式完全一致。

我注意到兩個模式具有相似的結構，只是組件類型不同。第一步是確定每種組件類型如何映射回文本中的元素。

例如，在“關鍵指標”下的“簡介”部分，有一個表格列出了兩個變量：缺陷率為0.5%，良率為98%。在填充後的模式中，這些成為表格組件的“變量”數組中的條目。因此，我將從文本的相應部分提取這些值，並將它們插入到空白模式中的相應位置。

接下來，在標題為“質量保證檢查表”的子部分中，有關於確認缺陷是否被識別和糾正以及是否定期進行審核的複選框。再次查看文本，我看到了描述是否進行了特定檢查的項目符號，例如確認是否發現了缺陷，然後記錄未確認的事項。由此，我可以推斷每個複選框對應一個確認語句。

然後是標題為“標準概述”的詳細部分。這裡有另一個比較ISO 9001和ASQ認證質量管理體系的表格。在文本中，在“質量保證標準比較”下，提到了檢查這兩個選項，並明確指示了真假狀態。因此，我將通過添加相同的比較詳細信息行來在空白模式中複製此結構。

在整個過程中，我的重點是將每個文本信息匹配到模式中的正確位置，不改變指定之外的任何其他部分。通過根據文本內容系統地識別和填充每個佔位符，最終結果應該與填充後的模式完全一致，因為填充版本中的每個元素在提取過程中都得到了考慮。
</think>

<answer>{
    "id": "0.0",
    "title": "Quality Assurance Manual for Manufacturing Process",
    "level": 0,
    "level_type": "ROOT",
    "component": [
        {
            "idc": 0,
            "component_type": "PARAGRAPH",
            "metadata": "<overview>This manual covers the QA procedures and guidelines for various manufacturing processes.</overview>",
            "properties": {
                "variables": [],
                "content": []
            }
        }
    ],
    "children": [
        {
            "id": "1.0",
            "title": "Introduction to Quality Assurance",
            "level": 1,
            "level_type": "SECTION",
            "component": [
                {
                    "idc": 0,
                    "component_type": "PARAGRAPH",
                    "metadata": "",
                    "properties": {
                        "variables": [],
                        "content": [
                            "Quality Assurance (QA) involves systematic activities designed to ensure consistent production of reliable and safe products.",
                            "It aims to reduce variability in the final product and improve overall efficiency."
                        ]
                    }
                },
                {
                    "idc": 1,
                    "component_type": "TABLE",
                    "metadata": "",
                    "properties": {
                        "variables": [
                            {
                                "idx": "0.0",
                                "name": "Metric",
                                "value": "Defect Rate",
                                "unit": "%",
                                "metrics": true,
                                "formula": false
                            },
                            {
                                "idx": "0.1",
                                "name": "Target",
                                "value": 0.5,
                                "unit": null,
                                "metrics": true,
                                "formula": false
                            }
                        ],
                        "content": []
                    }
                }
            ],
            "children": [
                {
                    "id": "1.1",
                    "title": "QA in Manufacturing Processes",
                    "level": 2,
                    "level_type": "SUBSECTION",
                    "component": [
                        {
                            "idc": 0,
                            "component_type": "PARAGRAPH",
                            "metadata": "",
                            "properties": {
                                "variables": [],
                                "content": [
                                    "Manufacturing processes require strict adherence to QA procedures to ensure product reliability and safety."
                                ]
                            }
                        },
                        {
                            "idc": 1,
                            "component_type": "CHECKBOX",
                            "metadata": "",
                            "properties": {
                                "variables": [
                                    {
                                        "idx": "0.0",
                                        "name": "Process Control",
                                        "value": true,
                                        "unit": null,
                                        "metrics": false,
                                        "formula": false
                                    },
                                    {
                                        "idx": "0.1",
                                        "name": "Raw Material Inspection",
                                        "value": false,
                                        "unit": null,
                                        "metrics": false,
                                        "formula": false
                                    }
                                ],
                                "content": [
                                    "Check if the following QA procedures are followed:",
                                    "\u2713 Process Control \u2713 Raw Material Inspection"
                                ]
                            }
                        }
                    ],
                    "children": [
                        {
                            "id": "1.1.1",
                            "title": "Standard QA Practices",
                            "level": 3,
                            "level_type": "DETAIL_N",
                            "component": [
                                {
                                    "idc": 0,
                                    "component_type": "PARAGRAPH",
                                    "metadata": "",
                                    "properties": {
                                        "variables": [],
                                        "content": [
                                            "Adherence to ISO 9001 standards helps maintain quality control."
                                        ]
                                    }
                                },
                                {
                                    "idc": 1,
                                    "component_type": "TABLE",
                                    "metadata": "",
                                    "properties": {
                                        "variables": [
                                            {
                                                "idx": "0.0",
                                                "name": "Comparison",
                                                "value": "ISO 9001 vs ASQ-Certified QM",
                                                "unit": null,
                                                "metrics": false,
                                                "formula": false
                                            }
                                        ],
                                        "content": []
                                    }
                                }
                            ],
                            "children": []
                        }
                    ]
                }
            ]
        }
    ]
}</answer>

示例響應2

<think>
思維鏈解釋：
為了根據提供的文本填充空白模式並確保其與填充後的模式完全匹配，我們可以按以下步驟進行：

1. **識別組件**：認識到兩個模式都包含具有不同組件類型（表格、段落、複選框）的相似結構。
2. **從文本中提取數據**：
   - 在引言部分找到包含缺陷率和良率等變量的表格。
   - 在子部分中識別關於缺陷識別和定期審核的確認語句。
3. **填充變量**：用提取的變量名稱和值（缺陷率 = 0.5%，良率 = 98%）填充“關鍵指標”下的表格組件。
4. **填充複選框**：將項目符號轉換為複選框的布爾值；設置確認以反映調查結果（缺陷發現為真，審核未進行為假）。
5. **添加標準表格**：在詳細部分創建第二個表格，比較ISO 9001和ASQ認證的質量管理體系，與原文一致。

通過將每個提取的細節與空白模式中的指定位置精確對齊，確保沒有額外或遺漏的數據，生成的模式將與填充後的模式的結構和內容完全匹配。

**交叉驗證**：每個操作都直接對應於填充特定字段，沒有偏差。因此，遵循這種結構化方法可確保轉換後的模式與目標填充後的模式具有相同的結果。
</think>

<answer>{
    "id": "0.0",
    "title": "Quality Assurance Manual",
    "level": 0,
    "level_type": "ROOT",
    "component": [
        {
            "idc": 0
        },
        {
            "idc": 1
        }
    ],
    "children": [
        {
            "id": "1.0",
            "title": "Introduction to Manufacturing Processes",
            "level": 1,
            "level_type": "SECTION",
            "component": [],
            "children": [
                {
                    "id": "1.1",
                    "title": "Detailed Process 1",
                    "level": 2,
                    "level_type": "SUBSECTION",
                    "component": [],
                    "children": []
                },
                {
                    "id": "1.2",
                    "title": "Operational Details",
                    "level": 2,
                    "level_type": "SUBSECTION",
                    "component": [
                        {
                            "idc": 0,
                            "component_type": "PARAGRAPH",
                            "metadata": "<overview>This section outlines the operational aspects of Process 1.</overview>",
                            "properties": {}
                        }
                    ],
                    "children": []
                }
            ]
        },
        {
            "id": "2.0",
            "title": "Advanced Manufacturing Techniques",
            "level": 1,
            "level_type": "SECTION",
            "component": [],
            "children": [
                {
                    "id": "2.1",
                    "title": "High-Temperature Processing",
                    "level": 2,
                    "level_type": "SUBSECTION",
                    "component": [
                        {
                            "idc": 0,
                            "component_type": "TABLE",
                            "metadata": "<title>Table of Specifications</title>",
                            "properties": {
                                "variables": [
                                    {
                                        "idx": "0.0",
                                        "name": "Parameter",
                                        "value": "Temperature",
                                        "unit": "\u00b0C",
                                        "metrics": true,
                                        "formula": false
                                    },
                                    {
                                        "idx": "1.0",
                                        "name": "Range",
                                        "value": "[200, 300]",
                                        "unit": null,
                                        "metrics": false,
                                        "formula": false
                                    }
                                ],
                                "content": [
                                    "Note: Ensure temperature control within specified range."
                                ]
                            }
                        }
                    ],
                    "children": []
                }
            ]
        }
    ]
}</answer>