Dimple-7B開源多模態大語言模型 - 超越同類實現更優效果的免費之選

首頁

Dimple 7B

由rp-yu開發

Dimple是首個結合自迴歸與擴散訓練範式的離散擴散多模態大語言模型（DMLLM），在LLaVA-NEXT相同數據集上訓練後，以3.9%的優勢超越LLaVA-NEXT-7B。

圖像生成文本

Transformers

英語開源協議:Apache-2.0 #擴散多模態大語言模型 #自迴歸擴散混合訓練 #圖像文本生成

下載量 422

發布時間 : 5/19/2025

模型概述

Dimple是一個多模態大語言模型，結合了自迴歸與擴散訓練範式，支持圖像文本到文本的任務。

模型特點

混合訓練

融合自迴歸與擴散訓練範式，增強模型性能。

擴散解碼

支持置信解碼、隨機解碼、maskgit式解碼及基於熵的解碼。

可控生成

通過結構先驗實現格式、結構與長度的細粒度控制。

類自迴歸預填充

採用預填充技術提升推理速度。

模型能力

圖像描述生成

多模態指令跟隨

文本生成

圖像分析

使用案例

多模態交互

圖像描述

生成對圖像的詳細描述。

生成自然且準確的圖像描述。

視覺問答

回答關於圖像內容的問題。

提供準確且上下文相關的答案。

🚀 Dimple-7B

Dimple-7B 是首個離散擴散多模態大語言模型（DMLLM），它採用了結合自迴歸和基於擴散的指令微調的混合訓練範式。該模型架構與 Qwen 和 LLaVA 類似，同時引入了 先自迴歸後擴散 的訓練策略：

階段 1：進行自迴歸微調，以實現對齊和初始指令調整。
階段 2：基於擴散的微調，以增強指令遵循能力。

在與 LLaVA-NEXT 相同的數據集上進行訓練，Dimple-7B 比 LLaVA-NEXT-7B 高出 3.9%，這表明在相似的訓練預算下，基於擴散的多模態語言模型可以與自迴歸模型相媲美。

模型信息

屬性	詳情
基礎模型	Dream-org/Dream-v0-Instruct-7B
數據集	liuhaotian/LLaVA-CC3M-Pretrain-595K、lmms-lab/LLaVA-NeXT-Data
語言	en
庫名稱	transformers
許可證	apache-2.0
評估指標	accuracy
任務類型	image-text-to-text
標籤	Diffusion_Multimodal_Large_Language_Model、MLLM、Discrete_Diffusion

模型圖片

✨ 模型 | 🎉 演示：與 Dimple 聊天 | 📄 論文 | 💻 代碼

✨ 主要特性

混合訓練：結合自迴歸和擴散訓練。
擴散解碼：支持置信度解碼、隨機解碼、maskgit 風格解碼和基於熵的解碼。
可控生成：通過結構先驗實現對格式、結構和長度的細粒度控制。
類自迴歸預填充：使用預填充技術提高推理速度。

📊 評估結果

基準測試	Dimple-7B（本模型）	LLaVA-1.5-7B	LLaVA-NEXT-7B	Eagle-7B	Eagle2-9B	Qwen-VL-7B	Qwen2.5-VL-7B
訓練樣本數	130 萬	120 萬	130 萬	240 萬	2780 萬	15 億	-
訓練詞元數	8 億	-	-	-	-	-	2.6T
基礎大語言模型	Dream (Qwen2.5)	Vicuna	Vicuna-1.5	Vicuna	Qwen2.5	Qwen	Qwen2.5
GQA	59.2	62.0	64.8	64.9	-	59.3	-
MMBench（英文測試）	74.6	64.3	68.7	68.4	-	-	83.5
MME（感知）	1514	1510	1519	1528	-	-	-
MME（認知）	432	-	332	-	-	-	-
MME（總計）	1946	-	1851	-	-	-	2347
POPE	86.2	85.8	86.7	88.8	-	-	-
MMMU（驗證集）	45.2	-	35.8	36.3	56.1	-	58.6
SQA（圖像）	77.1	66.8	72.8	70.0	-	-	-
AI2D	74.4	-	65.4	-	83.9	62.3	83.9
ChartQA	63.4	-	54.9	67.7	86.4	65.7	87.3
TextVQA	61.6	-	64.8	-	83.0	-	-
OCRBench	565	-	490	529	-	-	-
MathVista（迷你版）	42.3	-	33.0	-	63.8	37.0	68.2
MMVet	41.2	31.1	47.3	-	62.2	-	67.1

📦 安裝指南

確保你的環境包含以下版本：

transformers==4.46.2
torch==2.5.1
accelerate==1.6.0

💻 使用示例

基礎用法

import torch
from transformers import AutoProcessor, AutoModel
import json, requests
from PIL import Image

model_name = "rp-yu/Dimple-7B"
processor = AutoProcessor.from_pretrained(
    model_name,
    trust_remote_code=True
)
model = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
messages = [
    [{"role": "user", "content": [
        {"type": "image", "image": image_url},
        {"type": "text", "text": "Describe this image."}
    ]}],
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
)
images = [
    Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
]

inputs = processor(
    text=text,
    images=images,
    videos=None,
    padding="longest",
    return_tensors="pt",
)

input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
    input_ids,
    max_new_tokens=64,
    output_history=True,
    return_dict_in_generate=True,
    steps=64,
    temperature=0.2,
    top_p=0.95,
    alg="origin",
    use_cache=True,
    alg_p_threshold=0.95,
    use_original_confidence=True,
    decoding_pipeline="dim",
    **inputs
)

generations = [
    processor.tokenizer.decode(g[len(p):].cpu().tolist())
    for p, g in zip(input_ids, output.sequences)
]

for j in range(len(messages)):
    print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])

# output: 0 In the image, a woman wearing a shirt with a plaid and a dog are sitting together on a beach. The sun appears to be setting in the background, creating a warm and serene atmosphere.

📄 許可證

本項目採用 apache-2.0 許可證。

📚 引用信息

@misc{dimple,
      title={Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding}, 
      author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
      year={2025},
      eprint={2505.16990},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.16990}, 
}