Qwen2.5-VL-7B-Instruct_gemlite-ao_a8w8 Open-source Multimodal Model - Free Deployment, Supports Vision/Language Tasks

Qwen2.5 VL 7B Instruct Gemlite Ao A8w8

Developed by mobiuslabsgmbh

This is a multimodal large language model quantized with A8W8, based on Qwen2.5-VL-7B-Instruct, supporting vision and language tasks.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Multimodal Vision-Language #Efficient A8W8 Quantization #Image Caption Generation

Downloads 161

Release Time : 6/4/2025

Model Overview

This model is a quantized version of Qwen2.5-VL-7B-Instruct, using TorchAO and GemLite as backends, suitable for vision-language understanding and generation tasks.

Model Features

A8W8 Quantization

The model is quantized with 8-bit activation and 8-bit weight, reducing memory usage and computational requirements

Multimodal Support

Processes both image and text inputs simultaneously to achieve vision-language understanding

Efficient Inference

Optimizes inference performance using TorchAO and GemLite backends

Model Capabilities

Image Caption Generation

Visual Question Answering

Multimodal Dialogue

Text Generation

Use Cases

Content Understanding

Image Captioning

Generates natural language descriptions based on input images

Can generate text accurately describing the image content

Intelligent Assistant

Multimodal Dialogue

Conducts dialogue interactions combining images and text

Can understand image content and answer related questions

🚀 Qwen2.5-VL-7B-Instruct A8W8 Quantized Model

This is an A8W8 quantized Qwen2.5-VL-7B-Instruct model, leveraging TorchAO and GemLite as the backend.

🚀 Quick Start

📦 Installation

First, install the dependencies:

pip install torchao;
pip install git+https://github.com/mobiusml/gemlite.git;
pip install qwen-vl-utils[decord]==0.0.8;

💻 Usage Examples

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_id = "mobiuslabsgmbh/Qwen2.5-VL-7B-Instruct_gemlite-ao_a8w8"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="cuda",
    #attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Advanced Usage

import torch
from vllm import LLM
from vllm.sampling_params import SamplingParams

model_id = "mobiuslabsgmbh/Qwen2.5-VL-7B-Instruct_gemlite-ao_a8w8"
processor_args = {
    'limit_mm_per_prompt': {"image": 3}, 
    'mm_processor_kwargs': {"min_pixels": 28 * 28, "max_pixels": 1280 * 28 * 28},
    'disable_mm_preprocessor_cache': False,
}

llm = LLM(model=model_id, gpu_memory_utilization=0.9, dtype=torch.float16, max_model_len=4096, 
          max_num_batched_tokens=4096, **processor_args) 

sampling_params = SamplingParams(max_tokens=1024, temperature=0.5, repetition_penalty=1.1, ignore_eos=False)

messages = [{"content": "You are a helpful assistant", "role":"system"}, {"content":"Solve this equation x^2 + 1 = -1.", "role":"user"}]
outputs = llm.chat(messages, sampling_params, chat_template=llm.get_tokenizer().chat_template)
print(outputs[0].outputs[0].text)

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご