Tiangong - R1V2 - 38B Open - source Multimodal Inference Model - Free Deployment for Visual Inference and Text Comprehension

Skywork R1V2 38B

Developed by Skywork

Skywork-R1V2-38B is currently the most advanced open-source multimodal reasoning model, demonstrating outstanding performance in multiple benchmark tests with robust visual reasoning and text comprehension capabilities.

Image-to-Text

Transformers

Open Source License:MIT #Multimodal Reasoning #Visual Language Understanding #Academic Benchmark Leader

Downloads 1,778

Release Time : 4/25/2025

Model Overview

A high-performance open-source vision-language model combining visual reasoning and text comprehension, leading other open-source models in benchmarks such as MMMU and OlympiadBench.

Model Features

Multimodal Reasoning Capability

Achieved a score of 73.6% in the MMMU test, the highest among all open-source models.

Outstanding Visual Understanding

Reached 62.6% on OlympiadBench, significantly outperforming other open-source models.

Comparable to Commercial Models

Demonstrated strong performance in MathVision, MMMU-Pro, and MathVista tests, approaching the performance of commercial closed-source models.

Open Source Accessibility

Fully open-source, available via Hugging Face and ModelScope model repositories.

Model Capabilities

Multimodal Reasoning

Visual Question Answering

Image Understanding

Complex Problem Solving

Cross-modal Information Processing

Use Cases

Education

Math Problem Solving

Analyze and solve problems containing mathematical formulas and diagrams.

Achieved 74.0% accuracy in the MathVista test.

Science Problem Solving

Understand scientific charts and answer related questions.

Achieved 62.6% accuracy in the OlympiadBench test.

Research

Multimodal Research

Used for cutting-edge research in vision-language models.

🚀 Skywork-R1V2

Skywork-R1V2-38B is a state-of-the-art open-source multimodal reasoning model. It combines powerful visual reasoning and text understanding, achieving top-tier performance across multiple benchmarks.

📖 R1V2 Report | 💻 GitHub | 🌐 ModelScope

🚀 Quick Start

The following steps will guide you through using Skywork-R1V2:

Clone the Repository:

git clone https://github.com/SkyworkAI/Skywork-R1V.git
cd skywork-r1v/inference

Set Up the Environment:

# For Transformers  
conda create -n r1-v python=3.10 && conda activate r1-v  
bash setup.sh  
# For vLLM  
conda create -n r1v-vllm python=3.10 && conda activate r1v-vllm  
pip install -U vllm

Run the Inference Script:

transformers inference:

CUDA_VISIBLE_DEVICES="0,1" python inference_with_transformers.py \
    --model_path path \
    --image_paths image1_path \
    --question "your question"

vllm inference:

python inference_with_vllm.py \
    --model_path path \
    --image_paths image1_path image2_path \
    --question "your question" \
    --tensor_parallel_size 4

✨ Features

Skywork-R1V2-38B is a state-of-the-art open-source multimodal reasoning model, achieving top-tier performance across multiple benchmarks:

On MMMU, it scores 73.6%, the highest among all open-source models to date.
On OlympiadBench, it achieves 62.6%, leading by a large margin over other open models.
R1V2 also performs strongly on MathVision, MMMU-Pro, and MathVista, rivaling proprietary commercial models.
Overall, R1V2 stands out as a high-performing, open-source VLM combining powerful visual reasoning and text understanding.

🔧 Technical Details

Property	Details
Model Type	Skywork-R1V2-38B
Vision Encoder	InternViT-6B-448px-V2_5
Language Model	Qwen/QwQ-32B
Hugging Face Link	🤗 Link

📚 Documentation

Evaluation

Comparison with Larger-Scale Open-Source Models

Comparison with Proprietary Models

Evaluation Results of State-of-the-Art LLMs and VLMs

Model	Supports Vision	Text Reasoning (%)						Multimodal Reasoning (%)
		AIME24	LiveCodebench	liveBench	IFEVAL	BFCL	GPQA	MMMU(val)	MathVista(mini)	MathVision(mini)	OlympiadBench	mmmu‑pro
R1V2‑38B	✅	78.9	63.6	73.2	82.9	66.3	61.6	73.6	74.0	49.0	62.6	52.0
R1V1‑38B	✅	72.0	57.2	54.6	72.5	53.5	–	68.0	67.0	–	40.4	–
Deepseek‑R1‑671B	❌	74.3	65.9	71.6	83.3	60.3	71.5	–	–	–	–	–
GPT‑o1	❌	79.8	63.4	72.2	–	–	–	–	–	–	–	–
GPT‑o4‑mini	✅	93.4	74.6	78.1	–	–	49.9	81.6	84.3	58.0	–	–
Claude 3.5 Sonnet	✅	–	–	–	–	–	65.0	66.4	65.3	–	–	–
Kimi k1.5 long-cot	✅	–	–	–	–	–	–	70.0	74.9	–	–	–
Qwen2.5‑VL‑72B‑Instruct	✅	–	–	–	–	–	–	70.2	74.8	–	–	–
InternVL2.5‑78B	✅	–	–	–	–	–	–	70.1	72.3	–	33.2	–

Evaluation Results of State-of-the-Art LLMs and VLMs

📄 License

This project is released under the MIT license.

📚 Citation

If you use Skywork-R1V in your research, please cite:

@misc{chris2025skyworkr1v2multimodalhybrid,
      title={Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning}, 
      author={Chris and Yichen Wei and Yi Peng and Xiaokun Wang and Weijie Qiu and Wei Shen and Tianyidan Xie and Jiangbo Pei and Jianhao Zhang and Yunzhuo Hao and Xuchen Song and Yang Liu and Yahui Zhou},
      year={2025},
      eprint={2504.16656},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.16656}, 
}

@misc{peng2025skyworkr1vpioneeringmultimodal,
      title={Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought}, 
      author={Yi Peng and Chris and Xiaokun Wang and Yichen Wei and Jiangbo Pei and Weijie Qiu and Ai Jian and Yunzhuo Hao and Jiachun Pan and Tianyidan Xie and Li Ge and Rongxian Zhuang and Xuchen Song and Yang Liu and Yahui Zhou},
      year={2025},
      eprint={2504.05599},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.05599}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご