q-sit-mini开源图像评分解析系统 - 免费实现图像质量评估与解析

首页

Q Sit Mini

由 zhangzicheng 开发

Q-SiT是一款基于大语言模型的图像质量评分与解析系统，能够同步执行图像质量评估和解析任务。

图像生成文本

Transformers

开源协议:MIT #图像质量评分 #视觉解析对话 #多任务统一框架

下载量 371

发布时间 : 3/11/2025

模型简介

Q-SiT创新性地利用大语言模型同步执行图像质量评分与解析两项任务，深刻把握人类视觉系统中感知与决策的内在关联，提供了一个统一的解决方案框架。

模型特点

统一评分与解析

将图像质量评分和解析任务整合到单一模型中，实现更高效的评估流程

五级评分系统

提供从优(Excellent)到劣(Bad)的五级质量评分，并可转换为0-1或0-5的数值评分

视觉语言理解

结合视觉特征提取和大语言模型的理解能力，实现更准确的图像质量分析

模型能力

图像质量评分

图像质量解析

视觉语言理解

图像特征分析

使用案例

图像质量评估

摄影作品质量评估

评估摄影作品的清晰度、噪点等质量指标

提供五级评分和详细质量分析

监控视频质量检测

检测监控视频中关键帧的图像质量

识别模糊、低光照等质量问题

图像处理

图像增强效果评估

评估图像增强算法处理前后的质量变化

提供量化评分和质量改进分析

🚀 Q-SiT: 基于大语言模型的图像质量评分与解读

Q-SiT是一个用于图像质量评分和解读的模型。它利用大语言模型同时执行这两项任务，认识到人类视觉系统中感知和决策之间的内在联系。与以往将评分和解读视为独立任务的方法不同，Q-SiT提供了一个统一的框架。

项目页面：https://github.com/Q-Future/Q-SiT

🚀 快速开始

无需安装此GitHub仓库。确保你使用的Transformers包版本为4.45.0（pip install transformers==4.45.0）。

💻 使用示例

基础用法

图像质量解读对话

import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_id = "zhangzicheng/q-sit-mini"
# if you want to use primary version, switch to q-sit
# model_id = "zhangzicheng/q-sit"

model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(0)

processor = AutoProcessor.from_pretrained(model_id)


conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "How is the clarity of the human in this image?"},
            {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

raw_image = Image.open(requests.get("https://github.com/Q-Future/Q-SiT/blob/main/44009500.jpg?raw=true",stream=True).raw)

inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True).split("assistant")[-1])
# very low

高级用法

图像质量评分

import torch
import requests
from PIL import Image
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration, AutoTokenizer
import numpy as np

def wa5(logits):
    logprobs = np.array([logits["Excellent"], logits["Good"], logits["Fair"], logits["Poor"], logits["Bad"]])
    probs = np.exp(logprobs) / np.sum(np.exp(logprobs))
    return np.inner(probs, np.array([1, 0.75, 0.5, 0.25, 0]))

model_id = "zhangzicheng/q-sit-mini"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(0)

processor = AutoProcessor.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Define rating tokens
toks = ["Excellent", "Good", "Fair", "Poor", "Bad"]
ids_ = [id_[0] for id_ in tokenizer(toks)["input_ids"]]
print("Rating token IDs:", ids_)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Assume you are an image quality evaluator. Your rating should be chosen from the following five categories: Excellent, Good, Fair, Poor, and Bad (from high to low). How would you rate the quality of this image?"},
            {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

# Load image
raw_image = Image.open(requests.get("https://github.com/Q-Future/Q-SiT/blob/main/44009500.jpg?raw=true",stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)

# Manually append the assistant prefix "The quality of this image is "
prefix_text = "The quality of this image is "
prefix_ids = tokenizer(prefix_text, return_tensors="pt")["input_ids"].to(0)
inputs["input_ids"] = torch.cat([inputs["input_ids"], prefix_ids], dim=-1)
inputs["attention_mask"] = torch.ones_like(inputs["input_ids"])  # Update attention mask

# Generate exactly one token (the rating)
output = model.generate(
    **inputs,
    max_new_tokens=1,  # Generate only the rating token
    output_logits=True,
    return_dict_in_generate=True,
)

# Extract logits for the generated rating token
last_logits = output.logits[-1][0]  # Shape: [vocab_size]
logits_dict = {tok: last_logits[id_].item() for tok, id_ in zip(toks, ids_)}
weighted_score = wa5(logits_dict)
print("Weighted average score:", weighted_score)
# Weighted average score: 0.045549712192942585  range from 0-1
# if you want range from 0-5, multiply 5

如需数据集评估脚本，请参考此目录。有关训练信息，请参阅GitHub仓库中的训练Q-SiT部分。

📄 许可证

本项目采用MIT许可证。

📚 引用

如果您觉得我们的工作有用，请按以下格式引用我们的论文：

@misc{zhang2025teachinglmmsimagequality,
      title={Teaching LMMs for Image Quality Scoring and Interpreting}, 
      author={Zicheng Zhang and Haoning Wu and Ziheng Jia and Weisi Lin and Guangtao Zhai},
      year={2025},
      eprint={2503.09197},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.09197}, 
}