Llama-3.1-8B-vision-378開源模型 - 為Llama 3添視覺功能，輕鬆處理圖像任務

首頁

Llama 3.1 8B Vision 378

由qresearch開發

該項目訓練了一個投影模塊，用於為Llama 3添加視覺能力，使用了SigLIP技術，並應用於Llama-3.1-8B-Instruct模型。

圖像生成文本

Transformers

#多模態視覺問答 #SigLIP投影技術 #4位量化支持

下載量 203

發布時間 : 7/23/2024

模型概述

這是一個結合視覺和語言能力的多模態模型，能夠處理圖像和文本輸入，生成文本輸出。

模型特點

視覺能力增強

通過訓練投影模塊為Llama 3模型添加視覺處理能力

SigLIP技術應用

使用SigLIP技術實現圖像和文本的聯合處理

4位量化支持

支持4位量化部署，降低硬件要求

模型能力

圖像理解

圖像描述生成

視覺問答

多模態推理

使用案例

圖像理解

圖像描述生成

輸入一張圖片，模型可以生成對圖片內容的文字描述

生成簡潔準確的圖片描述

視覺問答

基於圖片內容回答相關問題

提供與圖片內容相關的準確回答

🚀 llama-3.1-8B-vision-378

本項目是一個投影模塊，通過SigLIP訓練為Llama 3賦予視覺能力，隨後應用於Llama-3.1-8B-Instruct。由 @yeswondwerr 和 @qtnx_ 構建。

📄 許可證

本項目使用的許可證為 llama3.1。

📦 數據集

liuhaotian/LLaVA-CC3M-Pretrain-595K

🚀 快速開始

本模型的pipeline標籤為 image-text-to-text。

💻 使用示例

基礎用法

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO

url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))


model = AutoModelForCausalLM.from_pretrained(
    "qresearch/llama-3.1-8B-vision-378",
    trust_remote_code=True,
    torch_dtype=torch.float16,
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained("qresearch/llama-3.1-8B-vision-378", use_fast=True,)

print(
    model.answer_question(
        image, "Briefly describe the image", tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
    ),
)

高級用法

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import requests
from io import BytesIO


url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["mm_projector", "vision_model"],
)

model = AutoModelForCausalLM.from_pretrained(
    "qresearch/llama-3.1-8B-vision-378",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    quantization_config=bnb_cfg,
)

tokenizer = AutoTokenizer.from_pretrained(
    "qresearch/llama-3.1-8B-vision-378",
    use_fast=True,
)

print(
    model.answer_question(
        image, "Briefly describe the image", tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
    ),
)