🚀 VisualThinker-R1-Zero
VisualThinker-R1-Zero是一個專注於多模態推理的項目。它基於非SFT的2B模型,首次成功在多模態推理中實現了“頓悟時刻”和增加響應長度的效果。該模型在CVBench上取得了59.47%的準確率,超越了基礎模型約30%,並超過SFT設置約2%。
🚀 快速開始
本項目基於Qwen2-VL-2B模型,直接在SAT數據集上應用強化學習,實現了多模態推理能力的提升。項目代碼可在GitHub獲取。
✨ 主要特性
- 首次突破:首次在非SFT的2B模型上成功實現多模態推理的“頓悟時刻”和增加響應長度。
- 視覺任務受益:以視覺為中心的任務也能從改進的推理能力中受益。在基於視覺的推理任務的強化學習訓練中,模型表現出自我反思行為,能夠重新思考並糾正錯誤。例如:
. . .
Therefore, dark brown wooden bed with white blanket is not above the doorway.
But wait! I can think of something else.
Maybe it's just higher than above the doorway, but slightly lower than above the doorway.
. . .
📦 安裝指南
環境要求
- Python >= 3.10
- Pytorch == 2.0.1
- CUDA Version >= 11.7
安裝步驟
安裝所需的包:
pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils
💻 使用示例
基礎用法
from PIL import Image
import requests
from io import BytesIO
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero", torch_dtype="auto", device_map="auto")
model.eval()
image_url = "https://multimodal-r1.s3.us-west-1.amazonaws.com/demo_image.jpg"
question = "Considering the relative positions of the sofa and the picture in the image provided, where is the sofa located with respect to the picture? Select from the following choices.\n(A) above or \n(B) below"
prompt = f"A conversation between User and Assistant. The user asks a question about the image, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: {question} \nAssistant: Let me solve this step by step.\n<think>"
message = [
{
"type": "image",
"image": image_url,
},
{"type": "text", "text": "<image>" + prompt},
]
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
text = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
input = processor(
text=text,
image=image,
padding=True,
return_tensors="pt",
)
input = input.to("cuda")
generated_ids = model.generate(**input, use_cache=True, max_new_tokens=1024, do_sample=True)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(input.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
output_text = batch_output_text[0]
print(output_text)
🙌 保持聯繫
我們隨時歡迎進行有意義的討論、合作,甚至只是一起分享一杯虛擬咖啡。如需聯繫或加入我們的團隊,請訪問TurningPoint AI的主頁獲取聯繫方式。
📖 致謝
我們衷心感謝DeepSeek、Open-R1、QwenVL、Open-R1-Multimodal、R1-V、SAT和CV-Bench提供的開源資源,這些資源為我們的項目奠定了基礎。
🤝 貢獻者
以下是來自TurningPoint AI的本項目主要貢獻者:
Hengguang Zhou1* 、Xirui Li1* 、Ruochen Wang1† 、Minhao Cheng2、Tianyi Zhou3和Cho-Jui Hsieh14
* 項目負責人,† 主要顧問
1 加州大學洛杉磯分校,2 賓夕法尼亞州立大學,3 馬里蘭大學,4 谷歌研究院
✏️ 引用
@misc{zhou2025r1zerosahamomentvisual,
title={R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model},
author={Hengguang Zhou and Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh},
year={2025},
eprint={2503.05132},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.05132},
}
📄 許可證
本項目採用MIT許可證。