Pixtral-12b-nf4 Open-source AI Model - Free Deployment for Image-to-Text and Chinese Description Generation

Pixtral 12b Nf4

Developed by SeanScripts

A 4-bit quantized version based on the Mistral community's Pixtral-12B, focusing on image text-to-text tasks and supporting Chinese description generation.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Image description generation #4-bit quantization #Low video memory usage

Downloads 236

Release Time : 9/25/2024

Model Overview

This is a vision-language model quantized with NF4, capable of generating text descriptions based on input images. Implemented based on the Llava architecture, suitable for multimodal understanding tasks.

Model Features

4-bit quantization

Use BitsAndBytes for NF4 quantization, significantly reducing video memory requirements.

Multimodal understanding

Capable of processing both image and text inputs simultaneously to achieve visual-language interaction.

Efficient inference

Achieves a generation speed of 10 - 12 tokens per second on an RTX 4090.

Model Capabilities

Image description generation

Multimodal content understanding

Chinese text generation

Use Cases

Content creation

Automatic image annotation

Generate descriptive text for images.

Generate high-quality natural language descriptions.

Assistive tools

Visual impairment assistance

Convert visual content into text descriptions.

🚀 Pixtral-12B-NF4

This project is a conversion of mistral-community/pixtral-12b with 4-bit quantization, enabling efficient image-text-to-text processing.

🚀 Quick Start

This model is converted from mistral-community/pixtral-12b using BitsAndBytes with NF4 (4-bit) quantization, without double quantization. You need to install bitsandbytes to load the model.

💻 Usage Examples

Basic Usage

Here is an example of using the model for image captioning:

from transformers import LlavaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image
import time

# Load model
model_id = "SeanScripts/pixtral-12b-nf4"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    use_safetensors=True,
    device_map="cuda:0"
)
# Load tokenizer
processor = AutoProcessor.from_pretrained(model_id)

# Caption a local image
IMG_URLS = [Image.open("test.png").convert("RGB")]
PROMPT = "<s>[INST]Caption this image:\n[IMG][/INST]"

inputs = processor(images=IMG_URLS, text=PROMPT, return_tensors="pt").to("cuda")
prompt_tokens = len(inputs['input_ids'][0])
print(f"Prompt tokens: {prompt_tokens}")

t0 = time.time()
generate_ids = model.generate(**inputs, max_new_tokens=512)
t1 = time.time()
total_time = t1 - t0
generated_tokens = len(generate_ids[0]) - prompt_tokens
time_per_token = generated_tokens/total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")

output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

Performance

On a 4090 GPU, this model achieves about 10 - 12 tok/s (without flash attention), and the generated captions seem to be of good quality, although the testing scope is limited. It consumes approximately 10 GB of VRAM.

ComfyUI Custom Nodes

You can obtain a set of ComfyUI custom nodes for running this model at the following link: https://github.com/SeanScripts/ComfyUI-PixtralLlamaVision

📄 License

This project is licensed under the Apache-2.0 license.

📦 Information Table

Property	Details
Model Type	Converted from mistral-community/pixtral-12b using BitsAndBytes with NF4 (4-bit) quantization
Training Data	Not specified
Library Name	transformers
Pipeline Tag	image-text-to-text
License	Apache-2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご