Dimple-7B开源多模态大语言模型 - 超越同类实现更优效果的免费之选

首页

Dimple 7B

由 rp-yu 开发

Dimple是首个结合自回归与扩散训练范式的离散扩散多模态大语言模型（DMLLM），在LLaVA-NEXT相同数据集上训练后，以3.9%的优势超越LLaVA-NEXT-7B。

图像生成文本

Transformers

英语开源协议:Apache-2.0 #扩散多模态大语言模型 #自回归扩散混合训练 #图像文本生成

下载量 422

发布时间 : 5/19/2025

模型简介

Dimple是一个多模态大语言模型，结合了自回归与扩散训练范式，支持图像文本到文本的任务。

模型特点

混合训练

融合自回归与扩散训练范式，增强模型性能。

扩散解码

支持置信解码、随机解码、maskgit式解码及基于熵的解码。

可控生成

通过结构先验实现格式、结构与长度的细粒度控制。

类自回归预填充

采用预填充技术提升推理速度。

模型能力

图像描述生成

多模态指令跟随

文本生成

图像分析

使用案例

多模态交互

图像描述

生成对图像的详细描述。

生成自然且准确的图像描述。

视觉问答

回答关于图像内容的问题。

提供准确且上下文相关的答案。

🚀 Dimple-7B

Dimple-7B 是首个离散扩散多模态大语言模型（DMLLM），它采用了结合自回归和基于扩散的指令微调的混合训练范式。该模型架构与 Qwen 和 LLaVA 类似，同时引入了 先自回归后扩散 的训练策略：

阶段 1：进行自回归微调，以实现对齐和初始指令调整。
阶段 2：基于扩散的微调，以增强指令遵循能力。

在与 LLaVA-NEXT 相同的数据集上进行训练，Dimple-7B 比 LLaVA-NEXT-7B 高出 3.9%，这表明在相似的训练预算下，基于扩散的多模态语言模型可以与自回归模型相媲美。

模型信息

属性	详情
基础模型	Dream-org/Dream-v0-Instruct-7B
数据集	liuhaotian/LLaVA-CC3M-Pretrain-595K、lmms-lab/LLaVA-NeXT-Data
语言	en
库名称	transformers
许可证	apache-2.0
评估指标	accuracy
任务类型	image-text-to-text
标签	Diffusion_Multimodal_Large_Language_Model、MLLM、Discrete_Diffusion

模型图片

✨ 模型 | 🎉 演示：与 Dimple 聊天 | 📄 论文 | 💻 代码

✨ 主要特性

混合训练：结合自回归和扩散训练。
扩散解码：支持置信度解码、随机解码、maskgit 风格解码和基于熵的解码。
可控生成：通过结构先验实现对格式、结构和长度的细粒度控制。
类自回归预填充：使用预填充技术提高推理速度。

📊 评估结果

基准测试	Dimple-7B（本模型）	LLaVA-1.5-7B	LLaVA-NEXT-7B	Eagle-7B	Eagle2-9B	Qwen-VL-7B	Qwen2.5-VL-7B
训练样本数	130 万	120 万	130 万	240 万	2780 万	15 亿	-
训练词元数	8 亿	-	-	-	-	-	2.6T
基础大语言模型	Dream (Qwen2.5)	Vicuna	Vicuna-1.5	Vicuna	Qwen2.5	Qwen	Qwen2.5
GQA	59.2	62.0	64.8	64.9	-	59.3	-
MMBench（英文测试）	74.6	64.3	68.7	68.4	-	-	83.5
MME（感知）	1514	1510	1519	1528	-	-	-
MME（认知）	432	-	332	-	-	-	-
MME（总计）	1946	-	1851	-	-	-	2347
POPE	86.2	85.8	86.7	88.8	-	-	-
MMMU（验证集）	45.2	-	35.8	36.3	56.1	-	58.6
SQA（图像）	77.1	66.8	72.8	70.0	-	-	-
AI2D	74.4	-	65.4	-	83.9	62.3	83.9
ChartQA	63.4	-	54.9	67.7	86.4	65.7	87.3
TextVQA	61.6	-	64.8	-	83.0	-	-
OCRBench	565	-	490	529	-	-	-
MathVista（迷你版）	42.3	-	33.0	-	63.8	37.0	68.2
MMVet	41.2	31.1	47.3	-	62.2	-	67.1

📦 安装指南

确保你的环境包含以下版本：

transformers==4.46.2
torch==2.5.1
accelerate==1.6.0

💻 使用示例

基础用法

import torch
from transformers import AutoProcessor, AutoModel
import json, requests
from PIL import Image

model_name = "rp-yu/Dimple-7B"
processor = AutoProcessor.from_pretrained(
    model_name,
    trust_remote_code=True
)
model = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
messages = [
    [{"role": "user", "content": [
        {"type": "image", "image": image_url},
        {"type": "text", "text": "Describe this image."}
    ]}],
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
)
images = [
    Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
]

inputs = processor(
    text=text,
    images=images,
    videos=None,
    padding="longest",
    return_tensors="pt",
)

input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
    input_ids,
    max_new_tokens=64,
    output_history=True,
    return_dict_in_generate=True,
    steps=64,
    temperature=0.2,
    top_p=0.95,
    alg="origin",
    use_cache=True,
    alg_p_threshold=0.95,
    use_original_confidence=True,
    decoding_pipeline="dim",
    **inputs
)

generations = [
    processor.tokenizer.decode(g[len(p):].cpu().tolist())
    for p, g in zip(input_ids, output.sequences)
]

for j in range(len(messages)):
    print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])

# output: 0 In the image, a woman wearing a shirt with a plaid and a dog are sitting together on a beach. The sun appears to be setting in the background, creating a warm and serene atmosphere.

📄 许可证

本项目采用 apache-2.0 许可证。

📚 引用信息

@misc{dimple,
      title={Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding}, 
      author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
      year={2025},
      eprint={2505.16990},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.16990}, 
}