Q-SiT Mini開源圖像評估模型 - 免費部署精準分析圖像質量並評分

首頁

Q Sit

由zhangzicheng開發

Q-SiT Mini是一個輕量級的圖像質量評估與對話模型，專注於圖像質量分析和評分。

圖像生成文本

Transformers

開源協議:MIT #圖像質量評估 #多模態對話 #零樣本評分

下載量 79

發布時間 : 3/11/2025

模型概述

該模型能夠評估圖像質量並提供評分，同時支持關於圖像質量的對話式交互。它是Q-SiT模型的輕量級版本，適用於資源受限的環境。

模型特點

輕量級設計

相比標準版Q-SiT模型，Mini版本更輕量，適合資源受限的環境使用

圖像質量評分

能夠對圖像質量進行五級評分（Excellent到Bad）並計算加權平均分

質量對話功能

支持關於圖像質量的對話式交互，回答關於圖像清晰度等問題

模型能力

圖像質量評估

圖像質量對話

圖像質量評分

圖像到文本轉換

使用案例

圖像質量分析

圖像清晰度評估

評估圖像中人物的清晰度水平

輸出清晰度等級評價

圖像質量評分

對圖像質量進行1-5級評分

輸出0-1或0-5的評分結果

🚀 圖像質量評估模型（Q-SiT）

本項目提供了一個基於Hugging Face Transformers的圖像質量評估模型Q-SiT，可用於圖像質量解讀和評分。

🚀 快速開始

無需安裝本GitHub倉庫，只需確保使用版本為4.45.0的Transformers包（pip install transformers==4.45.0）。

📦 安裝指南

確保安裝指定版本的Transformers包：

pip install transformers==4.45.0

💻 使用示例

基礎用法

進行圖像質量解讀對話：

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_id = "zhangzicheng/q-sit-mini"
# if you want to use primary version, switch to q-sit
# model_id = "zhangzicheng/q-sit"

model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(0)

processor = AutoProcessor.from_pretrained(model_id)


conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "How is the clarity of the human in this image?"},
            {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

raw_image = Image.open(requests.get("https://github.com/Q-Future/Q-SiT/blob/main/44009500.jpg?raw=true",stream=True).raw)

inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True).split("assistant")[-1])
# very low

高級用法

進行圖像質量評分：

import torch
import requests
from PIL import Image
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration, AutoTokenizer
import numpy as np

def wa5(logits):
    logprobs = np.array([logits["Excellent"], logits["Good"], logits["Fair"], logits["Poor"], logits["Bad"]])
    probs = np.exp(logprobs) / np.sum(np.exp(logprobs))
    return np.inner(probs, np.array([1, 0.75, 0.5, 0.25, 0]))

model_id = "zhangzicheng/q-sit-mini"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(0)

processor = AutoProcessor.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define rating tokens
toks = ["Excellent", "Good", "Fair", "Poor", "Bad"]
ids_ = [id_[0] for id_ in tokenizer(toks)["input_ids"]]
print("Rating token IDs:", ids_)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Assume you are an image quality evaluator. Your rating should be chosen from the following five categories: Excellent, Good, Fair, Poor, and Bad (from high to low). How would you rate the quality of this image?"},
            {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

# Load image
raw_image = Image.open(requests.get("https://github.com/Q-Future/Q-SiT/blob/main/44009500.jpg?raw=true",stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

# Manually append the assistant prefix "The quality of this image is "
prefix_text = "The quality of this image is "
prefix_ids = tokenizer(prefix_text, return_tensors="pt")["input_ids"].to(0)
inputs["input_ids"] = torch.cat([inputs["input_ids"], prefix_ids], dim=-1)
inputs["attention_mask"] = torch.ones_like(inputs["input_ids"])  # Update attention mask

# Generate exactly one token (the rating)
output = model.generate(
    **inputs,
    max_new_tokens=1,  # Generate only the rating token
    output_logits=True,
    return_dict_in_generate=True,
)

# Extract logits for the generated rating token
last_logits = output.logits[-1][0]  # Shape: [vocab_size]
logits_dict = {tok: last_logits[id_].item() for tok, id_ in zip(toks, ids_)}
weighted_score = wa5(logits_dict)
print("Weighted average score:", weighted_score)
# Weighted average score: 0.045549712192942585  range from 0-1
# if you want range from 0-5, multiply 5