🚀 VisualThinker-R1-Zero
VisualThinker-R1-Zero是一个专注于多模态推理的项目。它基于非SFT的2B模型,首次成功在多模态推理中实现了“顿悟时刻”和增加响应长度的效果。该模型在CVBench上取得了59.47%的准确率,超越了基础模型约30%,并超过SFT设置约2%。
🚀 快速开始
本项目基于Qwen2-VL-2B模型,直接在SAT数据集上应用强化学习,实现了多模态推理能力的提升。项目代码可在GitHub获取。
✨ 主要特性
- 首次突破:首次在非SFT的2B模型上成功实现多模态推理的“顿悟时刻”和增加响应长度。
- 视觉任务受益:以视觉为中心的任务也能从改进的推理能力中受益。在基于视觉的推理任务的强化学习训练中,模型表现出自我反思行为,能够重新思考并纠正错误。例如:
. . .
Therefore, dark brown wooden bed with white blanket is not above the doorway.
But wait! I can think of something else.
Maybe it's just higher than above the doorway, but slightly lower than above the doorway.
. . .
📦 安装指南
环境要求
- Python >= 3.10
- Pytorch == 2.0.1
- CUDA Version >= 11.7
安装步骤
安装所需的包:
pip install git+https://github.com/huggingface/transformers
pip install qwen-vl-utils
💻 使用示例
基础用法
from PIL import Image
import requests
from io import BytesIO
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero")
model = AutoModelForImageTextToText.from_pretrained("turningpoint-ai/VisualThinker-R1-Zero", torch_dtype="auto", device_map="auto")
model.eval()
image_url = "https://multimodal-r1.s3.us-west-1.amazonaws.com/demo_image.jpg"
question = "Considering the relative positions of the sofa and the picture in the image provided, where is the sofa located with respect to the picture? Select from the following choices.\n(A) above or \n(B) below"
prompt = f"A conversation between User and Assistant. The user asks a question about the image, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.\nUser: {question} \nAssistant: Let me solve this step by step.\n<think>"
message = [
{
"type": "image",
"image": image_url,
},
{"type": "text", "text": "<image>" + prompt},
]
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
text = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
input = processor(
text=text,
image=image,
padding=True,
return_tensors="pt",
)
input = input.to("cuda")
generated_ids = model.generate(**input, use_cache=True, max_new_tokens=1024, do_sample=True)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(input.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
output_text = batch_output_text[0]
print(output_text)
🙌 保持联系
我们随时欢迎进行有意义的讨论、合作,甚至只是一起分享一杯虚拟咖啡。如需联系或加入我们的团队,请访问TurningPoint AI的主页获取联系方式。
📖 致谢
我们衷心感谢DeepSeek、Open-R1、QwenVL、Open-R1-Multimodal、R1-V、SAT和CV-Bench提供的开源资源,这些资源为我们的项目奠定了基础。
🤝 贡献者
以下是来自TurningPoint AI的本项目主要贡献者:
Hengguang Zhou1* 、Xirui Li1* 、Ruochen Wang1† 、Minhao Cheng2、Tianyi Zhou3和Cho-Jui Hsieh14
* 项目负责人,† 主要顾问
1 加州大学洛杉矶分校,2 宾夕法尼亚州立大学,3 马里兰大学,4 谷歌研究院
✏️ 引用
@misc{zhou2025r1zerosahamomentvisual,
title={R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model},
author={Hengguang Zhou and Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh},
year={2025},
eprint={2503.05132},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.05132},
}
📄 许可证
本项目采用MIT许可证。