Llama-3.1-8B-vision-378 Open-Source Model: Adding Visual Capabilities to Llama 3 for Easy Image Task Processing

Llama 3.1 8B Vision 378

Developed by qresearch

This project trained a projection module to add visual capabilities to Llama 3 using SigLIP technology, applied to the Llama-3.1-8B-Instruct model.

Image-to-Text

Transformers

#Multimodal Visual Question Answering #SigLIP Projection Technology #4-bit Quantization Support

Downloads 203

Release Time : 7/23/2024

Model Overview

This is a multimodal model combining vision and language capabilities, capable of processing image and text inputs to generate text outputs.

Model Features

Enhanced Visual Capabilities

Added visual processing capabilities to the Llama 3 model through trained projection modules

SigLIP Technology Application

Implemented joint processing of images and text using SigLIP technology

4-bit Quantization Support

Supports 4-bit quantization deployment, reducing hardware requirements

Model Capabilities

Image Understanding

Image Caption Generation

Visual Question Answering

Multimodal Reasoning

Use Cases

Image Understanding

Image Caption Generation

Input an image, and the model can generate a textual description of the image content

Generates concise and accurate image descriptions

Visual Question Answering

Answers relevant questions based on image content

Provides accurate answers related to the image content

🚀 llama-3.1-8B-vision-378

A projection module trained to add vision capabilities to Llama 3 using SigLIP, then applied to Llama-3.1-8B-Instruct. Built by @yeswondwerr and @qtnx_.

🚀 Quick Start

This project is a projection module that adds vision capabilities to Llama 3. It uses SigLIP and applies the trained module to Llama - 3.1 - 8B - Instruct.

📄 License

The license for this project is llama3.1.

📦 Dataset

The dataset used in this project is liuhaotian/LLaVA-CC3M-Pretrain-595K.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO

url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))


model = AutoModelForCausalLM.from_pretrained(
    "qresearch/llama-3.1-8B-vision-378",
    trust_remote_code=True,
    torch_dtype=torch.float16,
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained("qresearch/llama-3.1-8B-vision-378", use_fast=True,)

print(
    model.answer_question(
        image, "Briefly describe the image", tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
    ),
)

Advanced Usage

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import requests
from io import BytesIO


url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

bnb_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["mm_projector", "vision_model"],
)

model = AutoModelForCausalLM.from_pretrained(
    "qresearch/llama-3.1-8B-vision-378",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    quantization_config=bnb_cfg,
)

tokenizer = AutoTokenizer.from_pretrained(
    "qresearch/llama-3.1-8B-vision-378",
    use_fast=True,
)

print(
    model.answer_question(
        image, "Briefly describe the image", tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
    ),
)

ASCII Art

                                       .x+=:.                                                             
                                      z`    ^%                                                  .uef^"    
               .u    .                   .   <k                           .u    .             :d88E       
    .u@u     .d88B :@8c       .u       .@8Ned8"      .u          u      .d88B :@8c        .   `888E       
 .zWF8888bx ="8888f8888r   ud8888.   .@^%8888"    ud8888.     us888u.  ="8888f8888r  .udR88N   888E .z8k  
.888  9888    4888>'88"  :888'8888. x88:  `)8b. :888'8888. .@88 "8888"   4888>'88"  <888'888k  888E~?888L 
I888  9888    4888> '    d888 '88%" 8888N=*8888 d888 '88%" 9888  9888    4888> '    9888 'Y"   888E  888E 
I888  9888    4888>      8888.+"     %8"    R88 8888.+"    9888  9888    4888>      9888       888E  888E 
I888  9888   .d888L .+   8888L        @8Wou 9%  8888L      9888  9888   .d888L .+   9888       888E  888E 
`888Nx?888   ^"8888*"    '8888c. .+ .888888P`   '8888c. .+ 9888  9888   ^"8888*"    ?8888u../  888E  888E 
 "88" '888      "Y"       "88888%   `   ^"F      "88888%   "888*""888"     "Y"       "8888P'  m888N= 888> 
       88E                  "YP'                   "YP'     ^Y"   ^Y'                  "P'     `Y"   888  
       98>                                                                                          J88"  
       '8                                                                                           @%    
        `                                                                                         :"

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご