Qwen2.5-VL-7B-Captioner-Relaxed Open-source Multimodal Model - Text-to-image Optimization for Generating Detailed Image Descriptions

Qwen2.5 VL 7B Captioner Relaxed

Developed by Ertugrul

A multimodal large language model fine-tuned based on Qwen2.5-VL-7B-Instruct, specifically optimized for text-to-image generation, capable of producing more detailed image descriptions

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Image Captioning #Text-to-Image Optimization #Detail-Enhanced Annotation

Downloads 1,339

Release Time : 3/21/2025

Model Overview

This is an improved version of a multimodal large language model, focusing on generating high-quality image description texts, particularly suitable for training data generation in text-to-image models.

Model Features

Detail Enhancement

Generates more comprehensive and detailed image descriptions

Relaxed Constraints

Provides less restrictive image descriptions compared to the base model

Natural Language Output

Describes different subjects in the image and their positional relationships using natural language

Text-to-Image Optimization

Generates annotation formats compatible with advanced text-to-image models

Upgraded Base Model

Utilizes improvements from the Qwen2.5 architecture for better overall performance and comprehension

Model Capabilities

Image Understanding

Natural Language Generation

Multimodal Processing

Detailed Image Caption Generation

Use Cases

Text-to-Image Model Training

Training Data Generation

Generates high-quality image-text pair training data for text-to-image models

Improves the quality and relevance of images generated by text-to-image models

Image Annotation

Automatic Image Annotation

Generates detailed descriptive texts for image libraries

Enhances the accuracy of image retrieval and classification

🚀 Qwen2.5-VL-7B-Captioner-Relaxed

Qwen2.5-VL-7B-Captioner-Relaxed is an instruction - tuned multimodal large language model that provides detailed image descriptions.

🚀 Quick Start

To quickly get started with Qwen2.5-VL-7B-Captioner-Relaxed, follow the steps below.

Prerequisites

If you encounter errors such as KeyError: 'qwen2_vl' or ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers', try installing the latest version of the transformers library from source:

pip install git+https://github.com/huggingface/transformers accelerate

Code Example

import torch
from PIL import Image
from transformers import (
    AutoModelForImageTextToText,
    AutoProcessor
)

model_id = "Ertugrul/Qwen2.5-VL-7B-Captioner-Relaxed"
image_path = "path/to/your/image.jpg"

# the model requires more than 16GB of VRAM, 
# if you don't have you can use bitsandbytes to quantize the model to 8bit or 4bit

model = AutoModelForImageTextToText.from_pretrained(
  model_id,
  device_map="auto",
  torch_dtype=torch.bfloat16,
  attn_implementation="flash_attention_2", # Use "flash_attention_2" when running on Ampere or newer GPU or use "eager" for older GPUs
)

#### For lower precision less than 12GB VRAM ####

# Configure 4-bit quantization using BitsAndBytesConfig

#from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(
#         load_in_4bit=True,
#         bnb_4bit_use_double_quant=True,
#         bnb_4bit_quant_type="nf4",
#         bnb_4bit_compute_dtype=torch.bfloat16,
#         bnb_4bit_quant_storage=torch.bfloat16,
#     )
# model = AutoModelForImageTextToText.from_pretrained(
#     model_id,
#     device_map="auto",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2", # Use "flash_attention_2" when running on Ampere or newer GPU or use "eager" for older GPUs
#     quantization_config=quantization_config,  # Use BitsAndBytesConfig instead of load_in_4bit
# )

########################################################################

# you can change the min and max pixels to fit your needs to decrease compute cost to trade off quality
min_pixels = 256*28*28
max_pixels = 1280*28*28

processor = AutoProcessor.from_pretrained(model_id, max_pixels=max_pixels, min_pixels=min_pixels)

system_message = "You are an expert image describer."

def generate_description(path, model, processor):
    image_inputs = Image.open(path).convert("RGB")
    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_message}],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {"type": "image", "image": image_inputs},
            ],
        },
    ]
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    inputs = processor(
        text=[text],
        images=image_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(model.device)
    
    # min_p and temperature are experemental parameters, you can change them to fit your needs
    generated_ids = model.generate(**inputs, max_new_tokens=512, min_p=0.1, do_sample=True, temperature=1.5)
    generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    return output_text[0]

description = generate_description(image_path, model, processor)
print(description)

✨ Features

Enhanced Detail: Generates more comprehensive and nuanced image descriptions.
Relaxed Constraints: Offers less restrictive image descriptions compared to the base model.
Natural Language Output: Describes different subjects in the image while specifying their locations using natural language.
Optimized for Image Generation: Produces captions in formats compatible with state - of - the - art text - to - image generation models.
Improved Base Model: Leverages the advancements of Qwen2.5, potentially leading to better overall performance and understanding.

⚠️ Important Note

This fine - tuned model is optimized for creating text - to - image datasets. As a result, performance on other tasks may be lower compared to the original model.

📚 Documentation

Qwen2.5-VL-7B-Captioner-Relaxed is an instruction - tuned version of Qwen/Qwen2.5-VL-7B-Instruct, an advanced multimodal large language model. This is an updated version of Ertugrul/Qwen2-VL-7B-Captioner-Relaxed, retrained using the Qwen2.5 base model. This fine - tuned version is based on a hand - curated dataset for text - to - image models, providing significantly more detailed descriptions of given images. It is built upon the improved Qwen2.5 architecture.

📄 License

This project is licensed under the apache-2.0 license.

Acknowledgements

For more detailed options, refer to the Qwen/Qwen2.5-VL-7B-Instruct documentation.

Property	Details
Library Name	transformers
Tags	multimodal, qwen
License	apache-2.0
Language	en
Base Model	Qwen/Qwen2.5-VL-7B-Instruct
Pipeline Tag	image - text - to - text

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご