🚀 Qwen2-VL-7B-Captioner-Relaxed
Qwen2-VL-7B-Captioner-Relaxed is an instruction - tuned multimodal large language model that offers more detailed image descriptions.
✨ Features
- Enhanced Detail: Generates more comprehensive and nuanced image descriptions.
- Relaxed Constraints: Offers less restrictive image descriptions compared to the base model.
- Natural Language Output: Describes different subjects in the image while specifying their locations using natural language.
- Optimized for Image Generation: Produces captions in formats compatible with state - of - the - art text - to - image generation models.
⚠️ Important Note
This fine - tuned model is optimized for creating text - to - image datasets. As a result, performance on other tasks (e.g., ~10% decrease on mmmu_val) may be lower compared to the original model.
📦 Installation
If you encounter errors such as KeyError: 'qwen2_vl'
or ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers'
, try installing the latest version of the transformers library from source:
pip install git+https://github.com/huggingface/transformers
🚀 Quick Start
💻 Usage Examples
Basic Usage
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from transformers import BitsAndBytesConfig
import torch
model_id = "Ertugrul/Qwen2-VL-7B-Captioner-Relaxed"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "Describe this image."},
],
}
]
image = Image.open(r"PATH_TO_YOUR_IMAGE")
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
with torch.no_grad():
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
output_ids = model.generate(**inputs, max_new_tokens=384, do_sample=True, temperature=0.7, use_cache=True, top_k=50)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
print(output_text)
Gradio UI
If you prefer no coding option, there's simple gui that allows you to caption selected images. You can find more about it here:
qwen2vl-captioner-gui
📄 License
This project is licensed under the Apache 2.0 license.
Acknowledgements
- Google AI/ML Developer Programs team supported this work by providing Google Cloud Credit
For more detailed options, refer to the Qwen2-VL-7B-Instruct documentation.