đ Dimple-7B
Dimple is the first Discrete Diffusion Multimodal Large Language Model (DMLLM) that combines autoregressive and diffusion-based instruction tuning in a hybrid training approach. Similar to Qwen and LLaVA in architecture, it introduces an autoregressive-then-diffusion training strategy:
- Stage 1: Autoregressive fine-tuning for alignment and initial instruction tuning.
- Stage 2: Diffusion-based fine-tuning to enhance instruction-following capabilities.
Trained on the same dataset as LLaVA-NEXT, Dimple-7B outperforms LLaVA-NEXT-7B by 3.9%, indicating that diffusion-based multimodal language models can rival their autoregressive counterparts with a similar training budget.
Model   |   
Demo: Chat with Dimple   |   
Paper   |   
Code  
⨠Features
- Hybrid Training: Integrates autoregressive and diffusion training methods.
- Diffusion Decoding: Supports multiple decoding methods, including confident decoding, random decoding, maskgit-style decoding, and entropy-based decoding.
- Controllable Generation: Allows for fine-grained control over format, structure, and length through structure priors.
- Autoregressive-like Prefilling: Improves inference speed using prefilling techniques.
đ Evaluation Results
Property |
Details |
Benchmark |
Dimple-7B (ours), LLaVA-1.5-7B, LLaVA-NEXT-7B, Eagle-7B, Eagle2-9B, Qwen-VL-7B, Qwen2.5-VL-7B |
Training Samples |
1.3M, 1.2M, 1.3M, 2.4M, 27.8M, 1.5B, - |
Training Tokens |
0.8B, -, -, -, -, -, 2.6T |
Base LLM |
Dream (Qwen2.5), Vicuna, Vicuna-1.5, Vicuna, Qwen2.5, Qwen, Qwen2.5 |
GQA |
59.2, 62.0, 64.8, 64.9, -, 59.3, - |
MMBench (en test) |
74.6, 64.3, 68.7, 68.4, -, -, 83.5 |
MME (Perception) |
1514, 1510, 1519, 1528, -, -, - |
MME (Cognition) |
432, -, 332, -, -, -, - |
MME (Total) |
1946, -, 1851, -, -, -, 2347 |
POPE |
86.2, 85.8, 86.7, 88.8, -, -, - |
MMMU (val) |
45.2, -, 35.8, 36.3, 56.1, -, 58.6 |
SQA (img) |
77.1, 66.8, 72.8, 70.0, -, -, - |
AI2D |
74.4, -, 65.4, -, 83.9, 62.3, 83.9 |
ChartQA |
63.4, -, 54.9, 67.7, 86.4, 65.7, 87.3 |
TextVQA |
61.6, -, 64.8, -, 83.0, -, - |
OCRBench |
565, -, 490, 529, -, -, - |
MathVista (mini) |
42.3, -, 33.0, -, 63.8, 37.0, 68.2 |
MMVet |
41.2, 31.1, 47.3, -, 62.2, -, 67.1 |
đĻ Installation
Make sure your environment includes the following versions:
transformers==4.46.2
torch==2.5.1
accelerate==1.6.0
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoProcessor, AutoModel
import json, requests
from PIL import Image
model_name = "rp-yu/Dimple-7B"
processor = AutoProcessor.from_pretrained(
model_name,
trust_remote_code=True
)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
messages = [
[{"role": "user", "content": [
{"type": "image", "image": image_url},
{"type": "text", "text": "Describe this image."}
]}],
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
)
images = [
Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
]
inputs = processor(
text=text,
images=images,
videos=None,
padding="longest",
return_tensors="pt",
)
input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
input_ids,
max_new_tokens=64,
output_history=True,
return_dict_in_generate=True,
steps=64,
temperature=0.2,
top_p=0.95,
alg="origin",
use_cache=True,
alg_p_threshold=0.95,
use_original_confidence=True,
decoding_pipeline="dim",
**inputs
)
generations = [
processor.tokenizer.decode(g[len(p):].cpu().tolist())
for p, g in zip(input_ids, output.sequences)
]
for j in range(len(messages)):
print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])
đ License
This project is licensed under the apache-2.0
license.
đ Citation
@misc{dimple,
title={Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding},
author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
year={2025},
eprint={2505.16990},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.16990},
}