Dimple-7B Open-Source Multimodal Large Language Model - A Free Choice for Superior Performance Beyond Peers

Dimple 7B

Developed by rp-yu

Dimple is the first discrete diffusion multimodal large language model (DMLLM) that combines autoregressive and diffusion training paradigms. After training on the same dataset as LLaVA-NEXT, it outperforms LLaVA-NEXT-7B by 3.9%.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Diffusion Multimodal Large Language Model #Autoregressive Diffusion Hybrid Training #Image-to-Text Generation

Downloads 422

Release Time : 5/19/2025

Model Overview

Dimple is a multimodal large language model that integrates autoregressive and diffusion training paradigms, supporting image-to-text and text-to-text tasks.

Model Features

Hybrid Training

Combines autoregressive and diffusion training paradigms to enhance model performance.

Diffusion Decoding

Supports confidence decoding, random decoding, maskgit-style decoding, and entropy-based decoding.

Controlled Generation

Achieves fine-grained control over format, structure, and length through structural priors.

Autoregressive-like Prefilling

Uses prefilling techniques to improve inference speed.

Model Capabilities

Image Caption Generation

Multimodal Instruction Following

Text Generation

Image Analysis

Use Cases

Multimodal Interaction

Image Captioning

Generate detailed descriptions of images.

Produces natural and accurate image captions.

Visual Question Answering

Answer questions about image content.

Provides accurate and contextually relevant answers.

🚀 Dimple-7B

Dimple is the first Discrete Diffusion Multimodal Large Language Model (DMLLM) that combines autoregressive and diffusion-based instruction tuning in a hybrid training approach. Similar to Qwen and LLaVA in architecture, it introduces an autoregressive-then-diffusion training strategy:

Stage 1: Autoregressive fine-tuning for alignment and initial instruction tuning.
Stage 2: Diffusion-based fine-tuning to enhance instruction-following capabilities.

Trained on the same dataset as LLaVA-NEXT, Dimple-7B outperforms LLaVA-NEXT-7B by 3.9%, indicating that diffusion-based multimodal language models can rival their autoregressive counterparts with a similar training budget.

Model | Demo: Chat with Dimple | Paper | Code

✨ Features

Hybrid Training: Integrates autoregressive and diffusion training methods.
Diffusion Decoding: Supports multiple decoding methods, including confident decoding, random decoding, maskgit-style decoding, and entropy-based decoding.
Controllable Generation: Allows for fine-grained control over format, structure, and length through structure priors.
Autoregressive-like Prefilling: Improves inference speed using prefilling techniques.

📊 Evaluation Results

Property	Details
Benchmark	Dimple-7B (ours), LLaVA-1.5-7B, LLaVA-NEXT-7B, Eagle-7B, Eagle2-9B, Qwen-VL-7B, Qwen2.5-VL-7B
Training Samples	1.3M, 1.2M, 1.3M, 2.4M, 27.8M, 1.5B, -
Training Tokens	0.8B, -, -, -, -, -, 2.6T
Base LLM	Dream (Qwen2.5), Vicuna, Vicuna-1.5, Vicuna, Qwen2.5, Qwen, Qwen2.5
GQA	59.2, 62.0, 64.8, 64.9, -, 59.3, -
MMBench (en test)	74.6, 64.3, 68.7, 68.4, -, -, 83.5
MME (Perception)	1514, 1510, 1519, 1528, -, -, -
MME (Cognition)	432, -, 332, -, -, -, -
MME (Total)	1946, -, 1851, -, -, -, 2347
POPE	86.2, 85.8, 86.7, 88.8, -, -, -
MMMU (val)	45.2, -, 35.8, 36.3, 56.1, -, 58.6
SQA (img)	77.1, 66.8, 72.8, 70.0, -, -, -
AI2D	74.4, -, 65.4, -, 83.9, 62.3, 83.9
ChartQA	63.4, -, 54.9, 67.7, 86.4, 65.7, 87.3
TextVQA	61.6, -, 64.8, -, 83.0, -, -
OCRBench	565, -, 490, 529, -, -, -
MathVista (mini)	42.3, -, 33.0, -, 63.8, 37.0, 68.2
MMVet	41.2, 31.1, 47.3, -, 62.2, -, 67.1

📦 Installation

Make sure your environment includes the following versions:

transformers==4.46.2
torch==2.5.1
accelerate==1.6.0

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoProcessor, AutoModel
import json, requests
from PIL import Image

model_name = "rp-yu/Dimple-7B"
processor = AutoProcessor.from_pretrained(
    model_name,
    trust_remote_code=True
)
model = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
messages = [
    [{"role": "user", "content": [
        {"type": "image", "image": image_url},
        {"type": "text", "text": "Describe this image."}
    ]}],
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
)
images = [
    Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
]

inputs = processor(
    text=text,
    images=images,
    videos=None,
    padding="longest",
    return_tensors="pt",
)

input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
    input_ids,
    max_new_tokens=64,
    output_history=True,
    return_dict_in_generate=True,
    steps=64,
    temperature=0.2,
    top_p=0.95,
    alg="origin",
    use_cache=True,
    alg_p_threshold=0.95,
    use_original_confidence=True,
    decoding_pipeline="dim",
    **inputs
)

generations = [
    processor.tokenizer.decode(g[len(p):].cpu().tolist())
    for p, g in zip(input_ids, output.sequences)
]

for j in range(len(messages)):
    print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])

# output: 0 In the image, a woman wearing a shirt with a plaid and a dog are sitting together on a beach. The sun appears to be setting in the background, creating a warm and serene atmosphere.

📄 License

This project is licensed under the apache-2.0 license.

📚 Citation

@misc{dimple,
      title={Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding}, 
      author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
      year={2025},
      eprint={2505.16990},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.16990}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご