Qwen2.5 Omni 7B GPTQ 4bit
Q
Qwen2.5 Omni 7B GPTQ 4bit
Developed by FunAGI
A 4-bit GPTQ quantized version of the Qwen2.5-Omni-7B model, supporting multilingual and multimodal tasks.
Downloads 3,957
Release Time : 3/27/2025
Model Overview
This is a 4-bit GPTQ quantized Qwen2.5-Omni-7B model that supports text, image, and video processing, suitable for multilingual and multimodal tasks.
Model Features
4-bit GPTQ Quantization
The model is quantized to 4-bit, significantly reducing memory usage and computational resource requirements.
Multimodal Support
Supports text, image, and video processing for complex multimodal tasks.
Multilingual Support
Supports 8 languages, including English, German, French, etc.
Efficient Inference
Utilizes flash_attention_2 for efficient inference, improving processing speed.
Model Capabilities
Text Generation
Image Analysis
Video Understanding
Multilingual Processing
Multimodal Reasoning
Use Cases
Content Generation
Video Content Analysis
Analyze video content and generate descriptive text.
Accurately understands video content and generates relevant descriptions.
Language Translation
Multilingual Translation
Translate text from one language to another.
Supports accurate translation across multiple languages.
đ Qwen 2.5 Omni 4-bit Quantized Model
This project presents a 4-bit quantized Qwen2.5-Omni-7B model, leveraging GPTQModel for quantization. It offers an efficient way to run the model with reduced memory requirements while maintaining performance.
⨠Features
- Quantization: 4-bit quantization of the Qwen2.5-Omni-7B model.
- Multi-language Support: Supports multiple languages including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
- Model Loading: Provides code examples for both FP and GPTQ model loading.
đĻ Installation
According to the Qwen official documentation, follow these installation steps:
pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@3a1ead0aabed473eafe527915eea8c197d424356
pip install accelerate
pip install qwen-omni-utils[decord]
Install GPTQModel from the GitHub repository
đģ Usage Examples
Basic Usage
import os
import json
import torch
import torch.nn.functional as F
import numpy as np
from PIL import Image
from typing import Any, Dict, List, Optional, Tuple, Union
from transformers import (
Qwen2_5OmniModel,
Qwen2_5OmniProcessor,
AutoModelForVision2Seq,
AutoProcessor,
AutoTokenizer
)
from transformers.utils.hub import cached_file
from transformers.generation.utils import GenerateOutput
from gptqmodel import GPTQModel, QuantizeConfig, BACKEND
from gptqmodel.models.base import BaseGPTQModel
from gptqmodel.models.auto import MODEL_MAP, SUPPORTED_MODELS
from gptqmodel.models._const import CPU
from datasets import load_dataset
from qwen_omni_utils import process_mm_info
class Qwen25OmniThiknerGPTQ(BaseGPTQModel):
loader = Qwen2_5OmniModel
base_modules = [
"thinker.model.embed_tokens",
"thinker.model.norm",
"token2wav",
"thinker.audio_tower",
"thinker.model.rotary_emb",
"thinker.visual",
"talker"
]
pre_lm_head_norm_module = "thinker.model.norm"
require_monkeypatch = False
layers_node = "thinker.model.layers"
layer_type = "Qwen2_5OmniDecoderLayer"
layer_modules = [
["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
["self_attn.o_proj"],
["mlp.up_proj", "mlp.gate_proj"],
["mlp.down_proj"],
]
def pre_quantize_generate_hook_start(self):
self.thinker.visual = move_to(self.thinker.visual, device=self.quantize_config.device)
self.thinker.audio_tower = move_to(self.thinker.audio_tower, device=self.quantize_config.device)
def pre_quantize_generate_hook_end(self):
self.thinker.visual = move_to(self.thinker.visual, device=CPU)
self.thinker.audio_tower = move_to(self.thinker.audio_tower, device=CPU)
def preprocess_dataset(self, sample: Dict) -> Dict:
return sample
MODEL_MAP["qwen2_5_omni"] = Qwen25OmniThiknerGPTQ
SUPPORTED_MODELS.append("qwen2_5_omni")
model_path = "/home/chentianqi/model/Qwen/Qwen2.5-Omni-7B-GPTQ-4bit"
from types import MethodType
@classmethod
def patched_from_config(cls, config, *args, **kwargs):
kwargs.pop("trust_remote_code", None)
model = cls._from_config(config, **kwargs)
spk_path = cached_file(
model_path,
"spk_dict.pt",
subfolder=kwargs.pop("subfolder", None),
cache_dir=kwargs.pop("cache_dir", None),
force_download=kwargs.pop("force_download", False),
proxies=kwargs.pop("proxies", None),
resume_download=kwargs.pop("resume_download", None),
local_files_only=kwargs.pop("local_files_only", False),
token=kwargs.pop("use_auth_token", None),
revision=kwargs.pop("revision", None),
)
if spk_path is None:
raise ValueError(f"Speaker dictionary not found at {spk_path}")
model.load_speakers(spk_path)
return model
Qwen2_5OmniModel.from_config = patched_from_config
# FP Model
# model = Qwen2_5OmniModel.from_pretrained(
# model_path,
# torch_dtype=torch.bfloat16,
# device_map="auto",
# attn_implementation="flash_attention_2",
# )
# GPTQ MODEL
model = GPTQModel.load(
model_path,
device_map="cuda",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
)
Advanced Usage
from qwen_omni_utils import process_mm_info
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
# @title inference function
def inference(video_path, prompt, sys_prompt):
messages = [
{"role": "system", "content": sys_prompt},
{"role": "user", "content": [
{"type": "text", "text": prompt},
{"type": "video", "video": video_path},
]
},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# image_inputs, video_inputs = process_vision_info([messages])
audios, images, videos = process_mm_info(messages, use_audio_in_video=False)
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True)
inputs = inputs.to(model.device).to(model.dtype)
output = model.generate(**inputs, use_audio_in_video=False, return_audio=False)
text = processor.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=False)
return text
video_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/screen.mp4"
prompt = "Please trranslate the abstract of paper into Chinese."
# display(Video(video_path, width=640, height=360))
## Use a local HuggingFace model to inference.
response = inference(video_path, prompt=prompt, sys_prompt="You are a helpful assistant.")
print(response[0])
đ Documentation
Model Information
Property | Details |
---|---|
Model Type | 4-bit quantized Qwen2.5-Omni-7B model |
Base Model | Qwen/Qwen2.5-Omni-7B |
Pipeline Tag | any-to-any |
Tags | gptqmodel, FunAGI, Qwen, int4 |
Quantization Parameters
- bits: 4
- dynamic: null
- group_size: 128
- desc_act: true
- static_groups: false
- sym: false
- lm_head: false
- true_sequential: true
- quant_method: "gptq"
- checkpoint_format: "gptq"
- meta:
- quantizer: gptqmodel:1.1.0
- uri: https://github.com/modelcloud/gptqmodel
- damp_percent: 0.1
- damp_auto_increment: 0.0015
Model Size
Model Size | FP | 4-bit |
---|---|---|
22.39G | 12.71G |
đ§ Technical Details
This model is a 4-bit quantized version of the Qwen2.5-Omni-7B model. Quantization is performed using the GPTQModel library, which allows for efficient storage and inference of large language models. The quantization parameters are carefully tuned to balance between model size and performance.
đ License
This project is licensed under the MIT License.
Codebert Base
CodeBERT is a pre-trained model for programming languages and natural languages, based on the RoBERTa architecture, supporting functions like code search and code-to-document generation.
Multimodal Fusion
C
microsoft
1.6M
248
Llama 4 Scout 17B 16E Instruct
Other
Llama 4 Scout is a multimodal AI model developed by Meta, featuring a mixture-of-experts architecture, supporting text and image interactions in 12 languages, with 17B active parameters and 109B total parameters.
Multimodal Fusion
Transformers Supports Multiple Languages

L
meta-llama
817.62k
844
Unixcoder Base
Apache-2.0
UniXcoder is a unified multimodal pretrained model that leverages multimodal data such as code comments and abstract syntax trees for pretraining code representations.
Multimodal Fusion
Transformers English

U
microsoft
347.45k
51
TITAN
TITAN is a multimodal whole slide foundation model pre-trained through visual self-supervised learning and vision-language alignment for pathology image analysis.
Multimodal Fusion
Safetensors English
T
MahmoodLab
213.39k
37
Qwen2.5 Omni 7B
Other
Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities such as text, images, audio, and video, and generating text and natural speech responses in a streaming manner.
Multimodal Fusion
Transformers English

Q
Qwen
206.20k
1,522
Minicpm O 2 6
MiniCPM-o 2.6 is a GPT-4o-level multimodal large model that runs on mobile devices, supporting vision, voice, and live stream processing
Multimodal Fusion
Transformers Other

M
openbmb
178.38k
1,117
Llama 4 Scout 17B 16E Instruct
Other
Llama 4 Scout is a 17B parameter/16-expert multimodal AI model from Meta, supporting 12 languages and image understanding with industry-leading performance.
Multimodal Fusion
Transformers Supports Multiple Languages

L
chutesai
173.52k
2
Qwen2.5 Omni 3B
Other
Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities including text, images, audio, and video, while synchronously generating text and natural speech responses in a streaming manner.
Multimodal Fusion
Transformers English

Q
Qwen
48.07k
219
One Align
MIT
Q-Align is a multi-task visual assessment model focusing on Image Quality Assessment (IQA), Image Aesthetic Assessment (IAA), and Video Quality Assessment (VQA), published at ICML 2024.
Multimodal Fusion
Transformers

O
q-future
39.48k
25
Biomedvlp BioViL T
MIT
BioViL-T is a vision-language model focused on analyzing chest X-rays and radiology reports, enhancing performance through temporal multimodal pretraining.
Multimodal Fusion
Transformers English

B
microsoft
26.39k
35
Featured Recommended AI Models
Š 2025AIbase