Llama-3.2-11B-Vision-Instruct-nf4 Open-Source Model - Supports Free Image Understanding and Text Generation

Llama 3.2 11B Vision Instruct Nf4

Developed by SeanScripts

4-bit quantized version based on meta-llama/Llama-3.2-11B-Vision-Instruct, supporting image understanding and text generation tasks

Image-to-Text

Transformers

#4-bit quantized vision model #Image caption generation #Efficient inference

Downloads 658

Release Time : 9/25/2024

Model Overview

This is a multimodal model capable of understanding image content and generating relevant text descriptions. The model size is reduced through NF4 quantization technology, making it suitable for deployment in resource-constrained environments.

Model Features

4-bit Quantization Technology

Uses NF4 quantization technology to compress the model to 4-bit precision, significantly reducing memory usage

Multimodal Understanding

Capable of processing both image and text inputs, understanding image content, and generating relevant descriptions

Efficient Inference

The quantized model improves inference speed while maintaining good performance

Model Capabilities

Image content understanding

Image caption generation

Multimodal dialogue

Visual question answering

Use Cases

Content Generation

Automatic image captioning

Generates descriptive text for images, useful for content management systems

Produces accurate and fluent image descriptions

Assistive Tools

Assistance for visually impaired

Converts image content into spoken descriptions

Helps visually impaired individuals understand visual content

🚀 Llama-3.2-11B-Vision-Instruct-nf4

A converted model from meta-llama/Llama-3.2-11B-Vision-Instruct using BitsAndBytes with NF4 (4-bit) quantization for image - text - to - text tasks.

🚀 Quick Start

This model is converted from meta-llama/Llama-3.2-11B-Vision-Instruct using BitsAndBytes with NF4 (4-bit) quantization and does not use double quantization. It requires bitsandbytes to load.

✨ Features

Converted with NF4 (4-bit) quantization using BitsAndBytes.
Suitable for image - text - to - text tasks.

📦 Installation

To use this model, you need to have bitsandbytes installed.

💻 Usage Examples

Basic Usage

from transformers import MllamaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import time

# Load model
model_id = "SeanScripts/Llama-3.2-11B-Vision-Instruct-nf4"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    use_safetensors=True,
    device_map="cuda:0"
)
# Load tokenizer
processor = AutoProcessor.from_pretrained(model_id)

# Caption a local image (could use a more specific prompt)
IMAGE = Image.open("test.png").convert("RGB")
PROMPT = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Caption this image:
<|image|><|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

inputs = processor(IMAGE, PROMPT, return_tensors="pt").to(model.device)
prompt_tokens = len(inputs['input_ids'][0])
print(f"Prompt tokens: {prompt_tokens}")

t0 = time.time()
generate_ids = model.generate(**inputs, max_new_tokens=256)
t1 = time.time()
total_time = t1 - t0
generated_tokens = len(generate_ids[0]) - prompt_tokens
time_per_token = generated_tokens/total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")

output = processor.decode(generate_ids[0][prompt_tokens:]).replace('<|eot_id|>', '')
print(output)

You can get a set of ComfyUI custom nodes for running this model here: https://github.com/SeanScripts/ComfyUI-PixtralLlamaVision

📄 License

The license for this model is llama3.2.

Property	Details
Model Type	Converted from meta-llama/Llama-3.2-11B-Vision-Instruct using NF4 (4-bit) quantization
Base Model	meta-llama/Llama-3.2-11B-Vision-Instruct
Pipeline Tag	image-text-to-text
Library Name	transformers

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご