diagram2graph - adapters開源視覺語言模型，免費從圖像提取結構化數據轉知識圖譜

首頁

Diagram2graph Adapters

由zackriya開發

一個專注於從圖像中提取結構化數據(JSON)的視覺語言模型，特別擅長識別圖表中的節點、邊及其子屬性，將視覺信息表示為知識圖譜。

圖像生成文本

Safetensors

開源協議:Apache-2.0 #圖表轉JSON #知識圖譜構建 #視覺結構化提取

下載量 52

發布時間 : 3/14/2025

模型概述

該模型基於Qwen2.5-VL-3B-Instruct微調，專注於從流程和流程圖視覺表示中提取結構化數據，輸出為JSON格式。

模型特點

結構化數據提取

能夠從圖表圖像中精確提取節點、邊及其屬性，輸出為結構化的JSON格式

LoRA微調優化

採用基於LoRA的優化技術進行微調，提高模型性能

知識圖譜表示

將視覺信息轉換為知識圖譜形式，便於後續分析和處理

模型能力

圖表圖像分析

結構化數據提取

JSON格式輸出

知識圖譜構建

使用案例

圖表分析

流程圖解析

從流程圖中提取節點和邊的結構化信息

節點檢測提升14%，邊檢測提升23%

BPMN分析

支持BPMN圖表的自動化處理和分析

文檔處理

自動化文檔處理

從文檔中的圖表提取結構化數據

🚀 🖼️🔗 圖表轉知識圖譜模型

本模型是一個研究驅動的項目，由 Zackariya Solution 實習期間開發。它專注於從圖像中提取結構化數據（JSON），特別是節點、邊及其子屬性，將視覺信息表示為知識圖譜。

🚀 注意：本模型僅用於學習目的，不用於生產應用。提取的結構化數據可能會根據項目需求有所不同。

🚀 快速開始

安裝依賴

%pip install -q "transformers>=4.49.0" accelerate datasets "qwen-vl-utils[decord]==0.0.8"

運行推理代碼

import os
import PIL
import torch
from qwen_vl_utils import process_vision_info
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor


MODEL_ID="zackriya/diagram2graph-adapters"
MAX_PIXELS = 1280 * 28 * 28
MIN_PIXELS = 256 * 28 * 28


model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

processor = Qwen2_5_VLProcessor.from_pretrained(
    MODEL_ID,
    min_pixels=MIN_PIXELS,
    max_pixels=MAX_PIXELS
)


SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams.
Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format.
The diagram includes details such as nodes and edges. each of them have their own attributes.
Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure.
Provide only the JSON output based on the extracted information. Avoid additional explanations or comments."""

def run_inference(image):
    messages= [
        {
            "role": "system",
            "content": [{"type": "text", "text": SYSTEM_MESSAGE}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    # this image is handled by qwen_vl_utils's process_visio_Info so no need to worry about pil image or path
                    "image": image,
                },
                {
                    "type": "text",
                    "text": "Extract data in JSON format, Only give the JSON",
                },
            ],
        },
    ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, _ = process_vision_info(messages)

    inputs = processor(
        text=[text],
        images=image_inputs,
        return_tensors="pt",
    )
    inputs = inputs.to('cuda')

    generated_ids = model.generate(**inputs, max_new_tokens=512)
    generated_ids_trimmed = [
        out_ids[len(in_ids):]
        for in_ids, out_ids
        in zip(inputs.input_ids, generated_ids)
    ]

    output_text = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )
    return output_text
image = eval_dataset[9]['image'] # PIL image
# `image` could be URL or relative path to the image
output = run_inference(image)

# JSON loading
import json
json.loads(output[0])

✨ 主要特性

專注於從圖像中提取結構化數據（JSON），將視覺信息表示為知識圖譜。
可用於圖表轉知識圖譜的實驗和理解AI驅動的圖像結構化提取。

📋 模型詳情

屬性	詳情
開發團隊	Zackariya Solution 實習團隊（Mohammed Safvan）
微調基礎模型	`Qwen/Qwen2.5-VL-3B-Instruct`
許可證	Apache 2.0
語言	多語言（專注於結構化提取）
模型類型	視覺語言Transformer（PEFT微調）

🎯 使用場景

✅ 直接使用

進行圖表轉知識圖譜的實驗 📊
理解圖像的AI驅動結構化提取

🚀 下游使用（潛在）

增強 BPMN/流程圖 分析 🏗️
支持 自動化文檔處理 📄

❌ 不適用場景

不適用於實際生產部署 ⚠️
可能無法在所有圖表類型上很好地泛化

🏗️ 訓練詳情

數據集：內部整理的圖表數據集 🖼️
微調方法：基於LoRA的優化 ⚡
精度：bf16混合精度訓練 🎯

📈 評估

評估指標

指標：F1分數 🏆
侷限性：可能在處理複雜、密集的圖表時遇到困難 ⚠️

評估結果

節點檢測提高了14%
邊檢測提高了23%

樣本	(基礎)節點F1	(微調)節點F1	(基礎)邊F1	(微調)邊F1
image_sample_1	0.46	1.0	0.59	0.71
image_sample_2	0.67	0.57	0.25	0.25
image_sample_3	1.0	1.0	0.25	0.75
image_sample_4	0.5	0.83	0.15	0.62
image_sample_5	0.72	0.78	0.0	0.48
image_sample_6	0.75	0.75	0.29	0.67
image_sample_7	0.6	1.0	1.0	1.0
image_sample_8	0.6	1.0	1.0	1.0
image_sample_9	1.0	1.0	0.55	0.77
image_sample_10	0.67	0.8	0.0	1.0
image_sample_11	0.8	0.8	0.5	1.0
image_sample_12	0.67	1.0	0.62	0.75
image_sample_13	1.0	1.0	0.73	0.67
image_sample_14	0.74	0.95	0.56	0.67
image_sample_15	0.86	0.71	0.67	0.67
image_sample_16	0.75	1.0	0.8	0.75
image_sample_17	0.8	1.0	0.63	0.73
image_sample_18	0.83	0.83	0.33	0.43
image_sample_19	0.75	0.8	0.06	0.22
image_sample_20	0.81	1.0	0.23	0.75
平均值	0.749	0.891	0.4605	0.6945