diagram2graph - adapters: An open-source vision-language model for free extraction of structured data from images and conversion to knowledge graphs

Diagram2graph Adapters

Developed by zackriya

A vision-language model specialized in extracting structured data (JSON) from images, particularly adept at identifying nodes, edges, and their sub-properties in diagrams, representing visual information as knowledge graphs.

Image-to-Text

Safetensors

Open Source License:Apache-2.0 #Diagram to JSON #Knowledge Graph Construction #Visual Structured Extraction

Downloads 52

Release Time : 3/14/2025

Model Overview

This model is fine-tuned based on Qwen2.5-VL-3B-Instruct, focusing on extracting structured data from visual representations of processes and flowcharts, with output in JSON format.

Model Features

Structured Data Extraction

Accurately extracts nodes, edges, and their attributes from diagram images, outputting structured JSON format.

LoRA Fine-tuning Optimization

Utilizes LoRA-based optimization techniques for fine-tuning to enhance model performance.

Knowledge Graph Representation

Converts visual information into knowledge graph format for subsequent analysis and processing.

Model Capabilities

Diagram Image Analysis

Structured Data Extraction

JSON Format Output

Knowledge Graph Construction

Use Cases

Diagram Analysis

Flowchart Parsing

Extracts structured information of nodes and edges from flowcharts

Node detection improved by 14%, edge detection improved by 23%

BPMN Analysis

Supports automated processing and analysis of BPMN diagrams

Document Processing

Automated Document Processing

Extracts structured data from diagrams in documents

🚀 Diagram-to-Graph Model

This model, a research-driven project developed during an internship at Zackariya Solution, specializes in extracting structured data (JSON) from images. It can specifically extract nodes, edges, and their sub-attributes to represent visual information as knowledge graphs.

⚠️ Important Note

This model is for learning purposes only and not for production applications. The extracted structured data may vary according to project needs.

🚀 Quick Start

%pip install -q "transformers>=4.49.0" accelerate datasets "qwen-vl-utils[decord]==0.0.8"

import os
import PIL
import torch
from qwen_vl_utils import process_vision_info
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor


MODEL_ID="zackriya/diagram2graph-adapters"
MAX_PIXELS = 1280 * 28 * 28
MIN_PIXELS = 256 * 28 * 28


model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	MODEL_ID,
	device_map="auto",
	torch_dtype=torch.bfloat16
)

processor = Qwen2_5_VLProcessor.from_pretrained(
	MODEL_ID,
	min_pixels=MIN_PIXELS,
	max_pixels=MAX_PIXELS
)


SYSTEM_MESSAGE = """You are a Vision Language Model specialized in extracting structured data from visual representations of process and flow diagrams.
Your task is to analyze the provided image of a diagram and extract the relevant information into a well-structured JSON format.
The diagram includes details such as nodes and edges. each of them have their own attributes.
Focus on identifying key data fields and ensuring the output adheres to the requested JSON structure.
Provide only the JSON output based on the extracted information. Avoid additional explanations or comments."""

def run_inference(image):
	messages= [
    	{
        	"role": "system",
        	"content": [{"type": "text", "text": SYSTEM_MESSAGE}],
    	},
    	{
        	"role": "user",
        	"content": [
            	{
                	"type": "image",
                	# this image is handled by qwen_vl_utils's process_visio_Info so no need to worry about pil image or path
                	"image": image,
            	},
            	{
                	"type": "text",
                	"text": "Extract data in JSON format, Only give the JSON",
            	},
        	],
    	},
	]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, _ = process_vision_info(messages)

	inputs = processor(
    	text=[text],
    	images=image_inputs,
    	return_tensors="pt",
	)
	inputs = inputs.to('cuda')

	generated_ids = model.generate(**inputs, max_new_tokens=512)
	generated_ids_trimmed = [
    	out_ids[len(in_ids):]
    	for in_ids, out_ids
    	in zip(inputs.input_ids, generated_ids)
	]

	output_text = processor.batch_decode(
    	generated_ids_trimmed,
    	skip_special_tokens=True,
    	clean_up_tokenization_spaces=False
	)
	return output_text
image = eval_dataset[9]['image'] # PIL image
# `image` could be URL or relative path to the image
output = run_inference(image)

# JSON loading
import json
json.loads(output[0])

✨ Features

✅ Direct Use

Experiment with diagram-to-graph conversion 📊
Understand AI-driven structured extraction from images

🚀 Downstream Use (Potential)

Enhance BPMN/Flowchart analysis 🏗️
Support automated document processing 📄

❌ Out-of-Scope Use

Not designed for real-world production deployment ⚠️
May not generalize well across all diagram types

📚 Documentation

📝 Model Details

Property	Details
Developed by	Zackariya Solution Internship Team(Mohammed Safvan)
Fine Tuned from	`Qwen/Qwen2.5-VL-3B-Instruct`
License	Apache 2.0
Language(s)	Multilingual (focus on structured extraction)
Model type	Vision-Language Transformer (PEFT fine-tuned)

🏗️ Training Details

Dataset: Internally curated diagram dataset 🖼️
Fine-tuning: LoRA-based optimization ⚡
Precision: bf16 mixed-precision training 🎯

📈 Evaluation

Metrics: F1-score 🏆
Limitations: May struggle with complex, dense diagrams ⚠️

Results

+14% improvement in node detection
+23% improvement in edge detection

Samples	(Base)Node F1	(Fine)Node F1	(Base)Edge F1	(Fine)Edge F1
image_sample_1	0.46	1.0	0.59	0.71
image_sample_2	0.67	0.57	0.25	0.25
image_sample_3	1.0	1.0	0.25	0.75
image_sample_4	0.5	0.83	0.15	0.62
image_sample_5	0.72	0.78	0.0	0.48
image_sample_6	0.75	0.75	0.29	0.67
image_sample_7	0.6	1.0	1.0	1.0
image_sample_8	0.6	1.0	1.0	1.0
image_sample_9	1.0	1.0	0.55	0.77
image_sample_10	0.67	0.8	0.0	1.0
image_sample_11	0.8	0.8	0.5	1.0
image_sample_12	0.67	1.0	0.62	0.75
image_sample_13	1.0	1.0	0.73	0.67
image_sample_14	0.74	0.95	0.56	0.67
image_sample_15	0.86	0.71	0.67	0.67
image_sample_16	0.75	1.0	0.8	0.75
image_sample_17	0.8	1.0	0.63	0.73
image_sample_18	0.83	0.83	0.33	0.43
image_sample_19	0.75	0.8	0.06	0.22
image_sample_20	0.81	1.0	0.23	0.75
Mean	0.749	0.891	0.4605	0.6945

🤝 Collaboration

Are you interested in fine tuning your own model for your use case or want to explore how we can help you? Let's collaborate.

Zackriya Solutions

🔗 References

🚀Stay Curious & Keep Exploring!🚀

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご