MMR1-Math-v0-7B Open-Source Multi-Modal Model - Solve Math Tasks with State-of-the-Art Performance!

MMR1 Math V0 7B

Developed by MMR1

A large multimodal model focused on mathematical tasks, achieving state-of-the-art performance among open-source 7B multimodal models

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Mathematical Multimodal Reasoning #Few-shot Efficient Training #7B Parameter SOTA

Downloads 75

Release Time : 3/11/2025

Model Overview

MMR1-Math-v0-7B is a multimodal large model built upon Qwen2.5-VL-7B-Instruct, specializing in mathematical reasoning tasks. The model achieves SOTA performance with only 6k carefully selected training samples and excels in multiple mathematical reasoning benchmarks.

Model Features

SOTA Performance

Sets a new benchmark for mathematical tasks among open-source 7B multimodal models

Efficient Training

Achieves top-tier performance with only 6k high-quality samples and 6 hours of RL training

Data Strategy

High-quality public data uniformly sampled based on task difficulty and mathematical reasoning diversity

GRPO Training

Efficient RL training using 64 H100 GPUs (15 epochs)

Model Capabilities

Multimodal Mathematical Reasoning

Image-Text Understanding

Complex Mathematical Problem Solving

Logical Reasoning

Use Cases

Education

Math Problem Solving

Helps students understand and solve complex math problems

Achieves 71.0 points on benchmarks like MathVista

Research

Multimodal Reasoning Research

Provides a benchmark model for the field of multimodal reasoning

Outperforms peer models on multiple mathematical reasoning benchmarks

🚀 MMR1: Advancing the Frontiers of Multimodal Reasoning

MMR1-Math-v0-7B is a large multimodal model specialized in mathematical tasks, achieving state-of-the-art performance with only 6k data.

🚀 Quick Start

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "MMR1/MMR1-Math-v0-7B", 
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)
# default processer
processor = AutoProcessor.from_pretrained("MMR1/MMR1-Math-v0-7B")
# Example input
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/image.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Advanced Usage

Batch inference

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages2]
# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

✨ Features

Key Highlights:

SOTA Performance: Sets a new state-of-the-art benchmark on math-related multimodal tasks among open-source 7B models.
Minimal Training Data: Remarkably achieves top-tier performance with just 6k high-quality samples from public training datasets.
Efficient Training with GRPO: 6 hours of RL training with 64 H100s for 15 epochs.
Public and High-Quality Data: Publicly sourced datasets, rigorously filtered and balanced across both difficulty and mathematical problem types.
Balanced Data Strategy: Uniform sampling of data based on both task difficulty (filtering out overly simple problems) and mathematical reasoning diversity.

📚 Documentation

Model Description

MMR1-Math-v0-7B is a Large Multimodal Model specialized in mathematical tasks. Remarkably, MMR1-Math-v0-7B achieves state-of-the-art performance among open-source 7B multimodal models, competing effectively even against proprietary models with significantly larger parameter sizes—all trained using only 6k carefully curated data instances.

Evaluation Results

We evaluated our model using VLMEvalKit on four mathematical reasoning benchmarks: MathVista_MINI, MathVision, LogicVista, and MathVerse_MINI.

We also include results on the MathVerse_MINI_Vision_Only_cot (MathVerse_V) subset to maintain consistency with the VLMEvalKit leaderboard. The table below compares our model's performance against various open-source and proprietary models.

Model	size	MathVista	MathVision	LogicVista	MathVerse	MathVerse_V
Close-sourced
GPT-4o 1120	-	60.0	31.2	52.8	40.6	-
Gemini-2.0-flash	-	70.4	43.6	52.3	47.8	-
Claude3.7-Sonnet	-	66.8	41.9	58.2	46.7	-
R1-related
LLaVA-CoT	11B	52.5	19.9	39.6	22.6	-
Open-R1-Multimodal	7B	60.6	-	-	-	-
Mulberry	7B	63.1	-	-	-	-
LMM-R1	3B	63.2	26.4	-	-	41.6
R1-Onevision	7B	-	26.2	-	-	44.1
MM-Eureka	8B	67.1	22.2	-	-	40.4
MM-Eureka	38B	64.2	26.6	-	-	48.9
Open-sourced
Ovis2-8b	8B	71.8	25.9	39.4	42.3	-
MiniCPM-o-2.6	8B	71.9	21.7	36.0	35.0	-
Qwen2.5-VL (official)	7B	68.2	25.4	47.9	41.1	-
Qwen2.5-VL (reproduced)	7B	67.5	25.6	46.8	42.5	46.9
Ours
MMR1-math-v0	7B	71.0	30.2	50.8	45.1	49.8

Links

Code: https://github.com/LengSicong/MMR1

This model was presented in the paper LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL. Code can be found at https://github.com/LengSicong/MMR1

News

[2025.03.11] 🔥🔥 Release MMR1-Math-v0, achieving SOTA with only 6k data!

📄 License

This project is licensed under the apache-2.0 license.

📚 Citation

If you find MMR1 useful for your research and applications, please cite using this BibTeX:

@misc{MMR1-Math2025,
  title={MMR1: Advancing the Frontiers of Multimodal Reasoning},
  author={Sicong Leng*, Jing Wang*, Jiaxi Li*, Hao Zhang*, Zhiqiang Hu, Boqiang Zhang, Hang Zhang, Yuming Jiang, Xin Li, Fan Wang, Yu Rong, Aixin Sun†, Shijian Lu†},
  year={2025},
  howpublished={\url{https://github.com/LengSicong/MMR1}},
}

MMR1: Advancing the Frontiers of Multimodal Reasoning

If you like our project, please give us a star ⭐ on Github to support us. 🙏🙏

Information Table

Property	Details
Base Model	Qwen/Qwen2.5-VL-7B-Instruct
Language	en
Library Name	transformers
License	apache-2.0
Pipeline Tag	image-text-to-text
Tags	multi-modal, large-language-model

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご