llava-llama-3-8b-v1_1-transformers开源模型 - 免费部署实现图像文本转文本任务

首页

Llava Llama 3 8b V1 1 Transformers

由 xtuner 开发

基于Meta-Llama-3-8B-Instruct和CLIP-ViT-Large-patch14-336微调的LLaVA模型，支持图像文本到文本任务

图像生成文本

Safetensors

#多模态对话 #高分辨率图像理解 #LoRA微调

下载量 454.61k

发布时间 : 4/26/2024

模型简介

这是一个多模态模型，能够理解图像内容并生成相关文本描述或回答关于图像的问题。

模型特点

多模态理解

结合视觉编码器和语言模型，能够理解图像内容并生成相关文本

高性能

在多个基准测试中表现优于LLaVA-v1.5-7B模型

LoRA微调

使用LoRA技术对视觉编码器进行微调，提高模型性能

模型能力

图像内容理解

图像问答

多模态对话

视觉推理

使用案例

视觉问答

图像内容描述

对图像内容进行详细描述

准确识别图像中的物体、场景和关系

视觉推理

回答关于图像的推理问题

在MMBench等基准测试中表现优异

教育

科学问题解答

基于图像解答科学问题

在ScienceQA测试中获得72.9分

🚀 多模态大模型 llava-llama-3-8b-v1_1-hf

llava-llama-3-8b-v1_1-hf 是一款图像文本多模态大模型，基于 XTuner 工具包，使用 ShareGPT4V-PT 和 InternVL-SFT 数据集，对 meta-llama/Meta-Llama-3-8B-Instruct 和 CLIP-ViT-Large-patch14-336 进行微调得到。

🚀 快速开始

通过 `pipeline` 进行对话

from transformers import pipeline
from PIL import Image    
import requests

model_id = "xtuner/llava-llama-3-8b-v1_1-transformers"
pipe = pipeline("image-to-text", model=model_id, device=0)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = ("<|start_header_id|>user<|end_header_id|>\n\n<image>\nWhat are these?<|eot_id|>"
          "<|start_header_id|>assistant<|end_header_id|>\n\n")
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': 'user\n\n\nWhat are these?assistant\n\nThese are two cats, one brown and one gray, lying on a pink blanket. sleep. brown and gray cat sleeping on a pink blanket.'}]

通过纯 `transformers` 进行对话

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "xtuner/llava-llama-3-8b-v1_1-transformers"

prompt = ("<|start_header_id|>user<|end_header_id|>\n\n<image>\nWhat are these?<|eot_id|>"
          "<|start_header_id|>assistant<|end_header_id|>\n\n")
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
>>> These are two cats, one brown and one gray, lying on a pink blanket. sleep. brown and gray cat sleeping on a pink blanket.

复现实验

请参考文档。

✨ 主要特性

多模态融合：结合图像和文本信息，实现更丰富的交互。
多种格式支持：支持 HuggingFace LLaVA 格式、XTuner LLaVA 格式和 GGUF 格式。

📚 详细文档

模型信息

llava-llama-3-8b-v1_1-hf 是由 XTuner 基于 meta-llama/Meta-Llama-3-8B-Instruct 和 CLIP-ViT-Large-patch14-336，使用 ShareGPT4V-PT 和 InternVL-SFT 数据集微调得到的 LLaVA 模型。

注意：此模型为 HuggingFace LLaVA 格式。

资源链接

GitHub: xtuner
官方 LLaVA 格式模型: xtuner/llava-llama-3-8b-v1_1-hf
XTuner LLaVA 格式模型: xtuner/llava-llama-3-8b-v1_1
GGUF 格式模型: xtuner/llava-llama-3-8b-v1_1-gguf

模型细节

模型	视觉编码器	投影器	分辨率	预训练策略	微调策略	预训练数据集	微调数据集
LLaVA-v1.5-7B	CLIP-L	MLP	336	冻结 LLM，冻结 ViT	全量 LLM，冻结 ViT	LLaVA-PT (558K)	LLaVA-Mix (665K)
LLaVA-Llama-3-8B	CLIP-L	MLP	336	冻结 LLM，冻结 ViT	全量 LLM，LoRA ViT	LLaVA-PT (558K)	LLaVA-Mix (665K)
LLaVA-Llama-3-8B-v1.1	CLIP-L	MLP	336	冻结 LLM，冻结 ViT	全量 LLM，LoRA ViT	ShareGPT4V-PT (1246K)	InternVL-SFT (1268K)

模型效果

模型	MMBench 测试 (英文)	MMBench 测试 (中文)	CCBench 开发集	MMMU 验证集	SEED-IMG	AI2D 测试	ScienceQA 测试	HallusionBench 准确率	POPE	GQA	TextVQA	MME	MMStar
LLaVA-v1.5-7B	66.5	59.0	27.5	35.3	60.5	54.8	70.4	44.9	85.9	62.0	58.2	1511/348	30.3
LLaVA-Llama-3-8B	68.9	61.6	30.4	36.8	69.8	60.9	73.3	47.3	87.2	63.5	58.0	1506/295	38.2
LLaVA-Llama-3-8B-v1.1	72.3	66.4	31.6	36.8	70.1	70.0	72.9	47.7	86.4	62.6	59.0	1469/349	45.1

📄 许可证

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}