llava-llama-3-8b-v1_1-transformers Open-source Model - Free Deployment for Image-Text to Text Tasks

Llava Llama 3 8b V1 1 Transformers

Developed by xtuner

A LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-text-to-text tasks

Image-to-Text

Safetensors

#Multimodal Dialogue #High-Resolution Image Understanding #LoRA Fine-Tuning

Downloads 454.61k

Release Time : 4/26/2024

Model Overview

This is a multimodal model capable of understanding image content and generating relevant textual descriptions or answering questions about images.

Model Features

Multimodal Understanding

Combines visual encoder and language model to understand image content and generate relevant text

High Performance

Outperforms LLaVA-v1.5-7B model on multiple benchmarks

LoRA Fine-Tuning

Uses LoRA technology to fine-tune the visual encoder, improving model performance

Model Capabilities

Image content understanding

Image question answering

Multimodal dialogue

Visual reasoning

Use Cases

Visual Question Answering

Image Content Description

Provides detailed descriptions of image content

Accurately identifies objects, scenes, and relationships in images

Visual Reasoning

Answers reasoning questions about images

Excellent performance on benchmarks like MMBench

Education

Science Question Answering

Answers science questions based on images

Achieved 72.9 on ScienceQA test

🚀 llava-llama-3-8b-v1_1-hf

A LLaVA model fine - tuned from Meta - Llama - 3 - 8B - Instruct and CLIP - ViT - Large - patch14 - 336 with specific datasets.

🚀 Quick Start

Chat by `pipeline`

from transformers import pipeline
from PIL import Image    
import requests

model_id = "xtuner/llava-llama-3-8b-v1_1-transformers"
pipe = pipeline("image-to-text", model=model_id, device=0)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = ("<|start_header_id|>user<|end_header_id|>\n\n<image>\nWhat are these?<|eot_id|>"
          "<|start_header_id|>assistant<|end_header_id|>\n\n")
outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': 'user\n\n\nWhat are these?assistant\n\nThese are two cats, one brown and one gray, lying on a pink blanket. sleep. brown and gray cat sleeping on a pink blanket.'}]

Chat by pure `transformers`

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "xtuner/llava-llama-3-8b-v1_1-transformers"

prompt = ("<|start_header_id|>user<|end_header_id|>\n\n<image>\nWhat are these?<|eot_id|>"
          "<|start_header_id|>assistant<|end_header_id|>\n\n")
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
>>> These are two cats, one brown and one gray, lying on a pink blanket. sleep. brown and gray cat sleeping on a pink blanket.

Reproduce

Please refer to docs.

✨ Features

llava-llama-3-8b-v1_1-hf is a LLaVA model fine - tuned from meta-llama/Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.

Note: This model is in HuggingFace LLaVA format.

Resources:

GitHub: xtuner
Official LLaVA format model: xtuner/llava-llama-3-8b-v1_1-hf
XTuner LLaVA format model: xtuner/llava-llama-3-8b-v1_1
GGUF format model: xtuner/llava-llama-3-8b-v1_1-gguf

📚 Documentation

Details

Property	Details
Datasets	Lin - Chen/ShareGPT4V
Pipeline Tag	image - text - to - text
Library Name	xtuner

Model	Visual Encoder	Projector	Resolution	Pretraining Strategy	Fine - tuning Strategy	Pretrain Dataset	Fine - tune Dataset
LLaVA - v1.5 - 7B	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, Frozen ViT	LLaVA - PT (558K)	LLaVA - Mix (665K)
LLaVA - Llama - 3 - 8B	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	LLaVA - PT (558K)	LLaVA - Mix (665K)
LLaVA - Llama - 3 - 8B - v1.1	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	ShareGPT4V - PT (1246K)	InternVL - SFT (1268K)

Results

Model	MMBench Test (EN)	MMBench Test (CN)	CCBench Dev	MMMU Val	SEED - IMG	AI2D Test	ScienceQA Test	HallusionBench aAcc	POPE	GQA	TextVQA	MME	MMStar
LLaVA - v1.5 - 7B	66.5	59.0	27.5	35.3	60.5	54.8	70.4	44.9	85.9	62.0	58.2	1511/348	30.3
LLaVA - Llama - 3 - 8B	68.9	61.6	30.4	36.8	69.8	60.9	73.3	47.3	87.2	63.5	58.0	1506/295	38.2
LLaVA - Llama - 3 - 8B - v1.1	72.3	66.4	31.6	36.8	70.1	70.0	72.9	47.7	86.4	62.6	59.0	1469/349	45.1

📄 License

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご