Llava Llama 3 8b V1 1 Gguf

L

Llava Llama 3 8b V1 1 Gguf

Developed by xtuner

A multimodal model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image understanding and text generation

Image-to-Text #Multimodal Dialogue #High-Resolution Image Understanding #Llama-3 Fine-Tuning

Downloads 9,484

Release Time : 4/26/2024

Model Overview

This is a vision-language model capable of understanding image content and generating relevant textual descriptions, suitable for image-to-text tasks

Model Features

Powerful Visual Understanding

Combines CLIP-ViT-Large visual encoder for accurate image content comprehension

Llama-3 Language Model

Based on Meta's latest Llama-3-8B-Instruct model, providing high-quality text generation

Multi-Resolution Support

Supports image input with 336-pixel resolution

Efficient Fine-Tuning

Uses XTuner toolkit for efficient fine-tuning to optimize model performance

Model Capabilities

Image content understanding

Image caption generation

Multimodal Q&A

Visual reasoning

Use Cases

Image Understanding

Image Caption Generation

Generates detailed textual descriptions for input images

Produces natural and fluent image description texts

Visual Question Answering

Answers various questions about image content

Accurately responds to image-related questions

Education

Scientific Diagram Interpretation

Explains scientific charts and schematic diagrams

Helps students understand complex scientific concepts

🚀 llava-llama-3-8b-v1_1

llava-llama-3-8b-v1_1 is a fine - tuned LLaVA model for image - to - text tasks, offering high - performance visual understanding capabilities.

🚀 Quick Start

Download models

# mmproj
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-mmproj-f16.gguf

# fp16 llm
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-f16.gguf

# int4 llm
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-int4.gguf

# (optional) ollama fp16 modelfile
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/OLLAMA_MODELFILE_F16

# (optional) ollama int4 modelfile
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/OLLAMA_MODELFILE_INT4

Chat by `ollama`

# fp16
ollama create llava-llama3-f16 -f ./OLLAMA_MODELFILE_F16
ollama run llava-llama3-f16 "xx.png Describe this image"

# int4
ollama create llava-llama3-int4 -f ./OLLAMA_MODELFILE_INT4
ollama run llava-llama3-int4 "xx.png Describe this image"

Chat by `llama.cpp`

Build llama.cpp (docs).
Build ./llava-cli (docs).

Note: llava-llama-3-8b-v1_1 uses the Llama-3-instruct chat template.

# fp16
./llava-cli -m ./llava-llama-3-8b-v1_1-f16.gguf --mmproj ./llava-llama-3-8b-v1_1-mmproj-f16.gguf --image YOUR_IMAGE.jpg -c 4096 -e -p "<|start_header_id|>user<|end_header_id|>\n\n<image>\nDescribe this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

# int4
./llava-cli -m ./llava-llama-3-8b-v1_1-int4.gguf --mmproj ./llava-llama-3-8b-v1_1-mmproj-f16.gguf --image YOUR_IMAGE.jpg -c 4096 -e -p "<|start_header_id|>user<|end_header_id|>\n\n<image>\nDescribe this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

Reproduce

Please refer to docs.

✨ Features

Fine - Tuned Model: llava-llama-3-8b-v1_1 is fine - tuned from meta-llama/Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.
Multiple Formats: Available in GGUF format, and also has corresponding models in HuggingFace LLaVA format and Official LLaVA format.

📚 Documentation

Model

llava-llama-3-8b-v1_1 is a LLaVA model fine - tuned from meta-llama/Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.

Note: This model is in GGUF format.

Resources:

GitHub: xtuner
HuggingFace LLaVA format model: xtuner/llava-llama-3-8b-v1_1-transformers
Official LLaVA format model: xtuner/llava-llama-3-8b-v1_1-hf
XTuner LLaVA format model: xtuner/llava-llama-3-8b-v1_1

Details

Property	Details
Datasets	Lin - Chen/ShareGPT4V
Pipeline Tag	image - to - text

Model	Visual Encoder	Projector	Resolution	Pretraining Strategy	Fine - tuning Strategy	Pretrain Dataset	Fine - tune Dataset
LLaVA - v1.5 - 7B	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, Frozen ViT	LLaVA - PT (558K)	LLaVA - Mix (665K)
LLaVA - Llama - 3 - 8B	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	LLaVA - PT (558K)	LLaVA - Mix (665K)
LLaVA - Llama - 3 - 8B - v1.1	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	ShareGPT4V - PT (1246K)	InternVL - SFT (1268K)

Results

Model	MMBench Test (EN)	MMBench Test (CN)	CCBench Dev	MMMU Val	SEED - IMG	AI2D Test	ScienceQA Test	HallusionBench aAcc	POPE	GQA	TextVQA	MME	MMStar
LLaVA - v1.5 - 7B	66.5	59.0	27.5	35.3	60.5	54.8	70.4	44.9	85.9	62.0	58.2	1511/348	30.3
LLaVA - Llama - 3 - 8B	68.9	61.6	30.4	36.8	69.8	60.9	73.3	47.3	87.2	63.5	58.0	1506/295	38.2
LLaVA - Llama - 3 - 8B - v1.1	72.3	66.4	31.6	36.8	70.1	70.0	72.9	47.7	86.4	62.6	59.0	1469/349	45.1

📄 License

Citation

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase