VisualThinker-R1-Zero開源多模態推理模型 - 復現‘頓悟時刻’拓展響應長度

首頁

Visualthinker R1 Zero

由turningpoint-ai開發

首個在僅一個非監督微調的2B模型上覆現'頓悟時刻'和響應長度增加的多模態推理模型

圖像生成文本

Safetensors

英語開源協議:MIT #多模態推理 #強化學習優化 #視覺中心任務

下載量 578

發布時間 : 2/28/2025

模型概述

基於Qwen2-VL-2B基礎模型，通過強化學習在SAT數據集上訓練，提升了視覺中心任務的推理能力

模型特點

頓悟時刻復現

首個在非監督微調的2B模型上成功復現DeepSeek-R1的'頓悟時刻'特徵

視覺中心推理

展示了視覺中心任務也能從改進的推理能力中受益

自我反思能力

模型表現出重新思考並糾正錯誤的湧現能力

模型能力

多模態推理

圖像理解

文本生成

視覺中心任務處理

使用案例

視覺推理

物體位置分析

分析圖像中物體的相對位置關係

在CVBench上達到59.47%準確率

🚀 VisualThinker-R1-Zero

VisualThinker-R1-Zero是一個專注於多模態推理的項目。它基於非SFT的2B模型，首次成功在多模態推理中實現了“頓悟時刻”和增加響應長度的效果。該模型在CVBench上取得了59.47%的準確率，超越了基礎模型約30%，並超過SFT設置約2%。

🚀 快速開始

本項目基於Qwen2-VL-2B模型，直接在SAT數據集上應用強化學習，實現了多模態推理能力的提升。項目代碼可在GitHub獲取。

✨ 主要特性

首次突破：首次在非SFT的2B模型上成功實現多模態推理的“頓悟時刻”和增加響應長度。
視覺任務受益：以視覺為中心的任務也能從改進的推理能力中受益。在基於視覺的推理任務的強化學習訓練中，模型表現出自我反思行為，能夠重新思考並糾正錯誤。例如：

. . .
Therefore, dark brown wooden bed with white blanket is not above the doorway.
But wait! I can think of something else.
Maybe it's just higher than above the doorway, but slightly lower than above the doorway.
. . .

📦 安裝指南

環境要求

Python >= 3.10
Pytorch == 2.0.1
CUDA Version >= 11.7

安裝步驟

安裝所需的包：

# 安裝transformers
pip install git+https://github.com/huggingface/transformers
# 安裝qwen-vl工具
pip install qwen-vl-utils

💻 使用示例

基礎用法

from PIL import Image
import requests
from io import BytesIO
from transformers import AutoProcessor, AutoModelForImageTextToText

# 直接加載模型
processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero", torch_dtype="auto", device_map="auto")
model.eval()

# 準備圖像輸入
image_url = "https://multimodal-r1.s3.us-west-1.amazonaws.com/demo_image.jpg"

# 準備文本輸入
question = "Considering the relative positions of the sofa and the picture in the image provided, where is the sofa located with respect to the picture? Select from the following choices.\n(A) above or \n(B) below"
prompt = f"A conversation between User and Assistant. The user asks a question about the image, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: {question} \nAssistant: Let me solve this step by step.\n<think>"

# 創建消息
message = [
    {
        "type": "image",
        "image": image_url,
    },
    {"type": "text", "text": "<image>" + prompt},
]

# 處理輸入
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
text = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
input = processor(
    text=text,
    image=image,
    padding=True,
    return_tensors="pt",
)
input = input.to("cuda")

# 生成輸出
generated_ids = model.generate(**input, use_cache=True, max_new_tokens=1024, do_sample=True)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(input.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

# 獲取輸出
output_text = batch_output_text[0]
print(output_text)

🙌 保持聯繫

我們隨時歡迎進行有意義的討論、合作，甚至只是一起分享一杯虛擬咖啡。如需聯繫或加入我們的團隊，請訪問TurningPoint AI的主頁獲取聯繫方式。

📖 致謝

我們衷心感謝DeepSeek、Open-R1、QwenVL、Open-R1-Multimodal、R1-V、SAT和CV-Bench提供的開源資源，這些資源為我們的項目奠定了基礎。

🤝 貢獻者

以下是來自TurningPoint AI的本項目主要貢獻者：

Hengguang Zhou¹^*、Xirui Li¹^*、Ruochen Wang¹^†、Minhao Cheng²、Tianyi Zhou³和Cho-Jui Hsieh¹⁴

^* 項目負責人，^† 主要顧問 ¹ 加州大學洛杉磯分校，² 賓夕法尼亞州立大學，³ 馬里蘭大學，⁴ 谷歌研究院

✏️ 引用

@misc{zhou2025r1zerosahamomentvisual,
      title={R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model}, 
      author={Hengguang Zhou and Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh},
      year={2025},
      eprint={2503.05132},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.05132}, 
}