llava-llama-3-8b-v1_1-transformers開源模型 - 免費部署實現圖像文本轉文本任務

首頁

Llava Llama 3 8b V1 1 Transformers

由xtuner開發

基於Meta-Llama-3-8B-Instruct和CLIP-ViT-Large-patch14-336微調的LLaVA模型，支持圖像文本到文本任務

圖像生成文本

Safetensors

#多模態對話 #高分辨率圖像理解 #LoRA微調

下載量 454.61k

發布時間 : 4/26/2024

模型概述

這是一個多模態模型，能夠理解圖像內容並生成相關文本描述或回答關於圖像的問題。

模型特點

多模態理解

結合視覺編碼器和語言模型，能夠理解圖像內容並生成相關文本

高性能

在多個基準測試中表現優於LLaVA-v1.5-7B模型

LoRA微調

使用LoRA技術對視覺編碼器進行微調，提高模型性能

模型能力

圖像內容理解

圖像問答

多模態對話

視覺推理

使用案例

視覺問答

圖像內容描述

對圖像內容進行詳細描述

準確識別圖像中的物體、場景和關係

視覺推理

回答關於圖像的推理問題

在MMBench等基準測試中表現優異

教育

科學問題解答

基於圖像解答科學問題

在ScienceQA測試中獲得72.9分

🚀 多模態大模型 llava-llama-3-8b-v1_1-hf

llava-llama-3-8b-v1_1-hf 是一款圖像文本多模態大模型，基於 XTuner 工具包，使用 ShareGPT4V-PT 和 InternVL-SFT 數據集，對 meta-llama/Meta-Llama-3-8B-Instruct 和 CLIP-ViT-Large-patch14-336 進行微調得到。

🚀 快速開始

通過 `pipeline` 進行對話

from transformers import pipeline
from PIL import Image    
import requests

model_id = "xtuner/llava-llama-3-8b-v1_1-transformers"
pipe = pipeline("image-to-text", model=model_id, device=0)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = ("<|start_header_id|>user<|end_header_id|>\n\n<image>\nWhat are these?<|eot_id|>"
          "<|start_header_id|>assistant<|end_header_id|>\n\n")
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': 'user\n\n\nWhat are these?assistant\n\nThese are two cats, one brown and one gray, lying on a pink blanket. sleep. brown and gray cat sleeping on a pink blanket.'}]

通過純 `transformers` 進行對話

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "xtuner/llava-llama-3-8b-v1_1-transformers"

prompt = ("<|start_header_id|>user<|end_header_id|>\n\n<image>\nWhat are these?<|eot_id|>"
          "<|start_header_id|>assistant<|end_header_id|>\n\n")
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
>>> These are two cats, one brown and one gray, lying on a pink blanket. sleep. brown and gray cat sleeping on a pink blanket.

復現實驗

請參考文檔。

✨ 主要特性

多模態融合：結合圖像和文本信息，實現更豐富的交互。
多種格式支持：支持 HuggingFace LLaVA 格式、XTuner LLaVA 格式和 GGUF 格式。

📚 詳細文檔

模型信息

llava-llama-3-8b-v1_1-hf 是由 XTuner 基於 meta-llama/Meta-Llama-3-8B-Instruct 和 CLIP-ViT-Large-patch14-336，使用 ShareGPT4V-PT 和 InternVL-SFT 數據集微調得到的 LLaVA 模型。

注意：此模型為 HuggingFace LLaVA 格式。

資源鏈接

GitHub: xtuner
官方 LLaVA 格式模型: xtuner/llava-llama-3-8b-v1_1-hf
XTuner LLaVA 格式模型: xtuner/llava-llama-3-8b-v1_1
GGUF 格式模型: xtuner/llava-llama-3-8b-v1_1-gguf

模型細節

模型	視覺編碼器	投影器	分辨率	預訓練策略	微調策略	預訓練數據集	微調數據集
LLaVA-v1.5-7B	CLIP-L	MLP	336	凍結 LLM，凍結 ViT	全量 LLM，凍結 ViT	LLaVA-PT (558K)	LLaVA-Mix (665K)
LLaVA-Llama-3-8B	CLIP-L	MLP	336	凍結 LLM，凍結 ViT	全量 LLM，LoRA ViT	LLaVA-PT (558K)	LLaVA-Mix (665K)
LLaVA-Llama-3-8B-v1.1	CLIP-L	MLP	336	凍結 LLM，凍結 ViT	全量 LLM，LoRA ViT	ShareGPT4V-PT (1246K)	InternVL-SFT (1268K)

模型效果

模型	MMBench 測試 (英文)	MMBench 測試 (中文)	CCBench 開發集	MMMU 驗證集	SEED-IMG	AI2D 測試	ScienceQA 測試	HallusionBench 準確率	POPE	GQA	TextVQA	MME	MMStar
LLaVA-v1.5-7B	66.5	59.0	27.5	35.3	60.5	54.8	70.4	44.9	85.9	62.0	58.2	1511/348	30.3
LLaVA-Llama-3-8B	68.9	61.6	30.4	36.8	69.8	60.9	73.3	47.3	87.2	63.5	58.0	1506/295	38.2
LLaVA-Llama-3-8B-v1.1	72.3	66.4	31.6	36.8	70.1	70.0	72.9	47.7	86.4	62.6	59.0	1469/349	45.1

📄 許可證

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}