đ Pixtral-12B-Captioner-Relaxed
Pixtral-12B-Captioner-Relaxed is an instruction - tuned version of Pixtral-12B-2409, an advanced multimodal large language model. This fine - tuned model provides significantly more detailed descriptions of given images based on a hand - curated dataset for text - to - image models.
đ Quick Start
The following is a quick start guide to using Pixtral-12B-Captioner-Relaxed:
from PIL import Image
from transformers import LlavaForConditionalGeneration, AutoProcessor
from transformers import BitsAndBytesConfig
import torch
import matplotlib.pyplot as plt
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
model_id = "Ertugrul/Pixtral-12B-Captioner-Relaxed"
model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image.\n"},
{
"type": "image",
}
],
}
]
PROMPT = processor.apply_chat_template(conversation, add_generation_prompt=True)
image = Image.open(r"PATH_TO_YOUR_IMAGE")
def resize_image(image, target_size=768):
"""Resize the image to have the target size on the shortest side."""
width, height = image.size
if width < height:
new_width = target_size
new_height = int(height * (new_width / width))
else:
new_height = target_size
new_width = int(width * (new_height / height))
return image.resize((new_width, new_height), Image.LANCZOS)
image = resize_image(image, 768)
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
generate_ids = model.generate(**inputs, max_new_tokens=384, do_sample=True, temperature=0.3, use_cache=True, top_k=20)
output_text = processor.batch_decode(generate_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
print(output_text)
⨠Features
- Enhanced Detail: Generates more comprehensive and nuanced image descriptions.
- Relaxed Constraints: Offers less restrictive image descriptions compared to the base model.
- Natural Language Output: Describes different subjects in the image while specifying their locations using natural language.
- Optimized for Image Generation: Produces captions in formats compatible with state - of - the - art text - to - image generation models.
â ī¸ Important Note
This fine - tuned model is optimized for creating text - to - image datasets. As a result, performance on other complex tasks may be lower compared to the original model.
đĻ Installation
The 12B model needs 24GB of VRAM at half precision. The model can be loaded at 8 bit or 4 bit quantization, but expect degraded performance.
đ Documentation
For more detailed options, refer to the Pixtral-12B-2409 or mistral-community/pixtral-12b documentation.
You can also try the Qwen2-VL-7B-Captioner-Relaxed, for an alternative smaller model. It's trained in a similar manner.
đ License
The license for this project is apache - 2.0.
Property |
Details |
Library Name |
transformers |
License |
apache - 2.0 |
Base Model |
mistralai/Pixtral-12B-2409 |
Pipeline Tag |
image - to - text |