OpenVLThinker-7B Open-Source Visual Language Reasoning Model - Free Deployment, Specially Solving Visual Math Problems

Openvlthinker 7B

Developed by ydeng9

OpenVLThinker-7B is a vision-language reasoning model specifically designed for multimodal tasks, with particular optimization for solving visual mathematical problems.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Visual Mathematical Reasoning #Multimodal Understanding #High-Precision Vision-Language Model

Downloads 594

Release Time : 3/20/2025

Model Overview

A vision-language reasoning model based on Qwen2.5-VL-7B-Instruct, focused on solving complex visual mathematical problems with multimodal understanding and reasoning capabilities.

Model Features

Multimodal Reasoning

Capable of processing both visual and textual information for cross-modal reasoning.

Visual Mathematical Problem Solving

Specially optimized for solving mathematical problems requiring visual understanding.

Efficient Inference

Supports flash_attention_2 for efficient inference.

Model Capabilities

Image Understanding

Text Generation

Visual Mathematical Problem Solving

Multimodal Reasoning

Use Cases

Education

Visual Math Problem Solving

Helps students solve math problems containing diagrams and images.

Accurately understands the problem and provides solutions.

Research

Multimodal Reasoning Research

Used for research related to vision-language reasoning.

🚀 OpenVLThinker-7B

OpenVLThinker-7B is a vision-language reasoning model tailored for multimodal tasks, with a particular focus on visual mathematical problem-solving.

Property	Details
Base Model	Qwen/Qwen2.5-VL-7B-Instruct
License	apache-2.0
Library Name	transformers
Pipeline Tag	image-text-to-text

For more details: Blog, GitHub

🚀 Quick Start

This section demonstrates how to use the OpenVLThinker-7B model for multimodal tasks.

💻 Usage Examples

Basic Usage

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
import torch
from qwen_vl_utils import process_vision_info
import requests
from PIL import Image

# 1. Define model and processor names
model_name = "ydeng9/OpenVLThinker-7B"
processor_name = "Qwen/Qwen2.5-VL-7B-Instruct"

# 2. Load the OpenVLThinker-7B model and processor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device
)
processor = AutoProcessor.from_pretrained(processor_name)

# 3. Define a sample image URL and an instruction
image_url = "https://example.com/sample_image.jpg"  # replace with your image URL
instruction = "Example question"

# 4. Create a multimodal prompt using a chat message structure
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_url},
            {"type": "text", "text": instruction},
        ],
    }
]

# 5. Generate a text prompt from the chat messages
text_prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# 6. Process image (and video) inputs from the messages
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text_prompt],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(device)

# 7. Generate the model's response (with specified generation parameters)
generated_ids = model.generate(
    **inputs,
    do_sample=True,
    max_new_tokens=2048,
    top_p=0.001,
    top_k=1,
    temperature=0.01,
    repetition_penalty=1.0,
)

# 8. Decode the generated tokens into human-readable text
generated_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

# 9. Print the generated response
print("Generated Response:")
print(generated_text)

Citation

@misc{deng2025openvlthinker,
      title={OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement}, 
      author={Yihe Deng and Hritik Bansal and Fan Yin and Nanyun Peng and Wei Wang and Kai-Wei Chang},
      year={2025},
      eprint={2503.17352},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.17352}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご