Llama-3.1-8B-vision-378开源模型 - 为Llama 3添视觉功能，轻松处理图像任务

首页

Llama 3.1 8B Vision 378

由 qresearch 开发

该项目训练了一个投影模块，用于为Llama 3添加视觉能力，使用了SigLIP技术，并应用于Llama-3.1-8B-Instruct模型。

图像生成文本

Transformers

#多模态视觉问答 #SigLIP投影技术 #4位量化支持

下载量 203

发布时间 : 7/23/2024

模型简介

这是一个结合视觉和语言能力的多模态模型，能够处理图像和文本输入，生成文本输出。

模型特点

视觉能力增强

通过训练投影模块为Llama 3模型添加视觉处理能力

SigLIP技术应用

使用SigLIP技术实现图像和文本的联合处理

4位量化支持

支持4位量化部署，降低硬件要求

模型能力

图像理解

图像描述生成

视觉问答

多模态推理

使用案例

图像理解

图像描述生成

输入一张图片，模型可以生成对图片内容的文字描述

生成简洁准确的图片描述

视觉问答

基于图片内容回答相关问题

提供与图片内容相关的准确回答

🚀 llama-3.1-8B-vision-378

本项目是一个投影模块，通过SigLIP训练为Llama 3赋予视觉能力，随后应用于Llama-3.1-8B-Instruct。由 @yeswondwerr 和 @qtnx_ 构建。

📄 许可证

本项目使用的许可证为 llama3.1。

📦 数据集

liuhaotian/LLaVA-CC3M-Pretrain-595K

🚀 快速开始

本模型的pipeline标签为 image-text-to-text。

💻 使用示例

基础用法

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO

url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))


model = AutoModelForCausalLM.from_pretrained(
    "qresearch/llama-3.1-8B-vision-378",
    trust_remote_code=True,
    torch_dtype=torch.float16,
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained("qresearch/llama-3.1-8B-vision-378", use_fast=True,)

print(
    model.answer_question(
        image, "Briefly describe the image", tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
    ),
)

高级用法

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import requests
from io import BytesIO


url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["mm_projector", "vision_model"],
)

model = AutoModelForCausalLM.from_pretrained(
    "qresearch/llama-3.1-8B-vision-378",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    quantization_config=bnb_cfg,
)

tokenizer = AutoTokenizer.from_pretrained(
    "qresearch/llama-3.1-8B-vision-378",
    use_fast=True,
)

print(
    model.answer_question(
        image, "Briefly describe the image", tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
    ),
)