VisualThinker-R1-Zero Open-source Multimodal Inference Model - Replicating 'Eureka Moments' to Extend Response Length

Visualthinker R1 Zero

Developed by turningpoint-ai

The first multimodal reasoning model to reproduce 'Aha moments' and increased response length on just a 2B model with unsupervised fine-tuning

Image-to-Text

Safetensors

EnglishOpen Source License:MIT #Multimodal Reasoning #Reinforcement Learning Optimization #Vision-Centric Tasks

Downloads 578

Release Time : 2/28/2025

Model Overview

Based on the Qwen2-VL-2B foundation model, trained with reinforcement learning on the SAT dataset, enhancing reasoning capabilities for vision-centric tasks

Model Features

Reproduction of Aha Moments

First to successfully reproduce DeepSeek-R1's 'Aha moments' feature on a 2B model with unsupervised fine-tuning

Vision-Centric Reasoning

Demonstrates that vision-centric tasks can also benefit from improved reasoning capabilities

Self-Reflection Capability

The model exhibits emergent abilities to rethink and correct errors

Model Capabilities

Multimodal Reasoning

Image Understanding

Text Generation

Vision-Centric Task Processing

Use Cases

Visual Reasoning

Object Position Analysis

Analyze the relative positional relationships of objects in images

Achieved 59.47% accuracy on CVBench

🚀 VisualThinker-R1-Zero

VisualThinker-R1-Zero is a project that successfully replicates emergent characteristics for multimodal reasoning on a non - SFT 2B model, achieving high accuracy and sharing insights on the challenges of RL in multimodal reasoning.

Paper Link👁️

🚀 Quick Start

The recent DeepSeek - R1 demonstrated how reinforcement learning with simple rule - based reward can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self - reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics.

In this project, starting with Qwen2 - VL - 2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. The project code is available at https://github.com/turningpoint - ai/VisualThinker - R1 - Zero

✨ Features

We are the first to successfully produce the emergent “aha moment” and increased response length for multimodal reasoning on just a non - SFT 2B model.
We showed that vision - centric tasks could also benefit from improved reasoning capabilities.

Similar to DeepSeek R1, self - reflection behavior is also observed during our RL training on vision - centric reasoning tasks. The model exhibits an emergent ability to rethink and correct its mistakes:

. . .
Therefore, dark brown wooden bed with white blanket is not above the doorway.
But wait! I can think of something else.
Maybe it's just higher than above the doorway, but slightly lower than above the doorway.
. . .

📦 Installation

Python >= 3.10
Pytorch == 2.0.1
CUDA Version >= 11.7
Install required packages:

# install transformers
pip install git+https://github.com/huggingface/transformers
# install qwen-vl utils
pip install qwen-vl-utils

💻 Usage Examples

Basic Usage

from PIL import Image
import requests
from io import BytesIO
from transformers import AutoProcessor, AutoModelForImageTextToText

# Load model directly
processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero", torch_dtype="auto", device_map="auto")
model.eval()

# Prepare image input
image_url = "https://multimodal-r1.s3.us-west-1.amazonaws.com/demo_image.jpg"

# Prepare text input
question = "Considering the relative positions of the sofa and the picture in the image provided, where is the sofa located with respect to the picture? Select from the following choices.\n(A) above or \n(B) below"
prompt = f"A conversation between User and Assistant. The user asks a question about the image, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: {question} \nAssistant: Let me solve this step by step.\n<think>"

# Create Message
message = [
    {
        "type": "image",
        "image": image_url,
    },
    {"type": "text", "text": "<image>" + prompt},
]

# Process input
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
text = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
input = processor(
    text=text,
    image=image,
    padding=True,
    return_tensors="pt",
)
input = input.to("cuda")

# Generation of the output
generated_ids = model.generate(**input, use_cache=True, max_new_tokens=1024, do_sample=True)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(input.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

# Get output
output_text = batch_output_text[0]
print(output_text)

🙌 Stay Connected!

We are always open to engaging discussions, collaborations, or even just sharing a virtual coffee. To get in touch or join our team, visit TurningPoint AI's homepage for contact information.

📖 Acknowledgements

We sincerely thank DeepSeek, Open - R1, QwenVL, [Open - R1 - Multimodal](https://github.com/EvolvingLMMs - Lab/open-r1 - multimodal), [R1 - V](https://github.com/Deep - Agent/R1 - V), SAT, and [CV - Bench](https://cambrian - mllm.github.io/) for providing open - source resources that laid the foundation of our project.

🤝 Contributors

Here are the key contributors from TurningPoint AI to this project:

Hengguang Zhou¹^*, [Xirui Li](https://xirui - li.github.io/)¹^*, Ruochen Wang¹^†, Minhao Cheng², Tianyi Zhou³ and Cho - Jui Hsieh¹⁴

^* Project Leads, ^† Main Advisor ¹University of California, Los Angeles, ²Penn State University, ³University of Maryland and ⁴Google Research

✏️ Citation

@misc{zhou2025r1zerosahamomentvisual,
      title={R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model}, 
      author={Hengguang Zhou and Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh},
      year={2025},
      eprint={2503.05132},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.05132}, 
}

📄 License

This project is under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご