๐ MMR1: Advancing the Frontiers of Multimodal Reasoning
MMR1-Math-v0-7B is a large multimodal model specialized in mathematical tasks, achieving state-of-the-art performance with only 6k data.
๐ Quick Start
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"MMR1/MMR1-Math-v0-7B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("MMR1/MMR1-Math-v0-7B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/image.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Advanced Usage
Batch inference
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "What are the common elements in these pictures?"},
],
}
]
messages2 = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
messages = [messages1, messages2]
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
โจ Features
Key Highlights:
-
SOTA Performance: Sets a new state-of-the-art benchmark on math-related multimodal tasks among open-source 7B models.
-
Minimal Training Data: Remarkably achieves top-tier performance with just 6k high-quality samples from public training datasets.
-
Efficient Training with GRPO: 6 hours of RL training with 64 H100s for 15 epochs.
-
Public and High-Quality Data: Publicly sourced datasets, rigorously filtered and balanced across both difficulty and mathematical problem types.
-
Balanced Data Strategy: Uniform sampling of data based on both task difficulty (filtering out overly simple problems) and mathematical reasoning diversity.
๐ Documentation
Model Description
MMR1-Math-v0-7B is a Large Multimodal Model specialized in mathematical tasks. Remarkably, MMR1-Math-v0-7B achieves state-of-the-art performance among open-source 7B multimodal models, competing effectively even against proprietary models with significantly larger parameter sizesโall trained using only 6k carefully curated data instances.
Evaluation Results
We evaluated our model using VLMEvalKit on four mathematical reasoning benchmarks: MathVista_MINI, MathVision, LogicVista, and MathVerse_MINI.
We also include results on the MathVerse_MINI_Vision_Only_cot (MathVerse_V) subset to maintain consistency with the VLMEvalKit leaderboard. The table below compares our model's performance against various open-source and proprietary models.
Links
Code: https://github.com/LengSicong/MMR1
This model was presented in the paper LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL. Code can be found at https://github.com/LengSicong/MMR1
News
- [2025.03.11] ๐ฅ๐ฅ Release MMR1-Math-v0, achieving SOTA with only 6k data!
๐ License
This project is licensed under the apache-2.0 license.
๐ Citation
If you find MMR1 useful for your research and applications, please cite using this BibTeX:
@misc{MMR1-Math2025,
title={MMR1: Advancing the Frontiers of Multimodal Reasoning},
author={Sicong Leng*, Jing Wang*, Jiaxi Li*, Hao Zhang*, Zhiqiang Hu, Boqiang Zhang, Hang Zhang, Yuming Jiang, Xin Li, Fan Wang, Yu Rong, Aixin Sunโ , Shijian Luโ },
year={2025},
howpublished={\url{https://github.com/LengSicong/MMR1}},
}
MMR1: Advancing the Frontiers of Multimodal Reasoning
If you like our project, please give us a star โญ on Github to support us. ๐๐
Information Table
Property |
Details |
Base Model |
Qwen/Qwen2.5-VL-7B-Instruct |
Language |
en |
Library Name |
transformers |
License |
apache-2.0 |
Pipeline Tag |
image-text-to-text |
Tags |
multi-modal, large-language-model |