llava-llama-3-8b-v1_1-GGUF Open Source Model - Effortlessly Accomplish Image-to-Text Tasks

Llava Llama 3 8b V1 1 GGUF

Developed by MoMonir

LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks

Image-to-Text #Multimodal Dialogue #Image Understanding #Low-resource Deployment

Downloads 138

Release Time : 5/4/2024

Model Overview

This is a vision-language model capable of understanding image content and generating relevant textual descriptions, suitable for multimodal interaction scenarios.

Model Features

Multimodal Understanding

Combines visual encoder and language model to understand image content and generate relevant text

Efficient Fine-tuning

Uses LoRA technology to fine-tune the visual encoder, improving model performance

GGUF Format Support

Converted to GGUF format, compatible with various inference tools and platforms

Model Capabilities

Image Content Understanding

Image Caption Generation

Multimodal Dialogue

Visual Question Answering

Use Cases

Content Generation

Automatic Image Tagging

Generates descriptive text for images

Can be used to assist visually impaired individuals or content management systems

Education

Visual Question Answering System

Answers questions about image content

Achieved a score of 72.3 (EN) in MMBench testing

🚀 MoMonir/llava-llama-3-8b-v1_1-GGUF

This project presents the MoMonir/llava-llama-3-8b-v1_1-GGUF model, which is converted to the GGUF format. It offers a practical solution for image - to - text tasks, leveraging the power of the fine - tuned LLaVA model.

🚀 Quick Start

Download models

# mmproj
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-mmproj-f16.gguf

# fp16 llm
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-f16.gguf

# int4 llm
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/llava-llama-3-8b-v1_1-int4.gguf

# (optional) ollama fp16 modelfile
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/OLLAMA_MODELFILE_F16

# (optional) ollama int4 modelfile
wget https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf/resolve/main/OLLAMA_MODELFILE_INT4

Chat by `ollama`

# fp16
ollama create llava-llama3-f16 -f ./OLLAMA_MODELFILE_F16
ollama run llava-llama3-f16 "xx.png Describe this image"

# int4
ollama create llava-llama3-int4 -f ./OLLAMA_MODELFILE_INT4
ollama run llava-llama3-int4 "xx.png Describe this image"

Chat by `llama.cpp`

Build llama.cpp (docs).
Build ./llava-cli (docs).

Note: llava-llama-3-8b-v1_1 uses the Llama-3-instruct chat template.

# fp16
./llava-cli -m ./llava-llama-3-8b-v1_1-f16.gguf --mmproj ./llava-llama-3-8b-v1_1-mmproj-f16.gguf --image YOUR_IMAGE.jpg -c 4096 -e -p "<|start_header_id|>user<|end_header_id|>\n\n<image>\nDescribe this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

# int4
./llava-cli -m ./llava-llama-3-8b-v1_1-int4.gguf --mmproj ./llava-llama-3-8b-v1_1-mmproj-f16.gguf --image YOUR_IMAGE.jpg -c 4096 -e -p "<|start_header_id|>user<|end_header_id|>\n\n<image>\nDescribe this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

Reproduce

Please refer to docs.

✨ Features

Model Conversion: This model is converted to the GGUF format from xtuner/llava-llama-3-8b-v1_1.
Fine - Tuned Model: llava-llama-3-8b-v1_1 is a LLaVA model fine - tuned from meta-llama/Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.
Multiple Resources: It provides multiple resource links, including GitHub, HuggingFace LLaVA format models, and official LLaVA format models.

📦 Installation

The installation mainly involves downloading the model files as shown in the "Quick Start" section.

📚 Documentation

About GGUF (TheBloke Description)

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.

Here is an incomplete list of clients and libraries that are known to support GGUF:

llama.cpp. The source project for GGUF. Offers a CLI and a server option.
text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration.
KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling.
GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel.
LM Studio, an easy - to - use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023.
LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection.
Faraday.dev, an attractive and easy to use character - based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration.
[llama - cpp - python](https://github.com/abetlen/llama - cpp - python), a Python library with GPU accel, LangChain support, and OpenAI - compatible API server.
candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use.
ctransformers, a Python library with GPU accel, LangChain support, and OpenAI - compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models.

Model Details

Property	Details
Model Type	llava-llama-3-8b-v1_1 is a LLaVA model fine - tuned from meta-llama/Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.
Training Data	ShareGPT4V-PT (1246K) and InternVL-SFT (1268K)

Model Comparison Details

Model	Visual Encoder	Projector	Resolution	Pretraining Strategy	Fine - tuning Strategy	Pretrain Dataset	Fine - tune Dataset
LLaVA-v1.5-7B	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, Frozen ViT	LLaVA - PT (558K)	LLaVA - Mix (665K)
LLaVA-Llama-3-8B	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	LLaVA - PT (558K)	LLaVA - Mix (665K)
LLaVA-Llama-3-8B-v1.1	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	ShareGPT4V - PT (1246K)	InternVL - SFT (1268K)

Results

Model	MMBench Test (EN)	MMBench Test (CN)	CCBench Dev	MMMU Val	SEED - IMG	AI2D Test	ScienceQA Test	HallusionBench aAcc	POPE	GQA	TextVQA	MME	MMStar
LLaVA-v1.5-7B	66.5	59.0	27.5	35.3	60.5	54.8	70.4	44.9	85.9	62.0	58.2	1511/348	30.3
LLaVA-Llama-3-8B	68.9	61.6	30.4	36.8	69.8	60.9	73.3	47.3	87.2	63.5	58.0	1506/295	38.2
LLaVA-Llama-3-8B-v1.1	72.3	66.4	31.6	36.8	70.1	70.0	72.9	47.7	86.4	62.6	59.0	1469/349	45.1

🔧 Technical Details

The model is fine - tuned from specific base models with certain datasets, and the GGUF format conversion is based on the llama.cpp team's new standard. The fine - tuning and conversion processes involve specific technical strategies and operations.

📄 License

No license information is provided in the original document.

📖 Citation

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご