🚀 Q-SiT: 基於大語言模型的圖像質量評分與解讀
Q-SiT是一個用於圖像質量評分和解讀的模型。它利用大語言模型同時執行這兩項任務,認識到人類視覺系統中感知和決策之間的內在聯繫。與以往將評分和解讀視為獨立任務的方法不同,Q-SiT提供了一個統一的框架。
項目頁面:https://github.com/Q-Future/Q-SiT
🚀 快速開始
無需安裝此GitHub倉庫。確保你使用的Transformers包版本為4.45.0(pip install transformers==4.45.0
)。
💻 使用示例
基礎用法
圖像質量解讀對話
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
model_id = "zhangzicheng/q-sit-mini"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "How is the clarity of the human in this image?"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
raw_image = Image.open(requests.get("https://github.com/Q-Future/Q-SiT/blob/main/44009500.jpg?raw=true",stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True).split("assistant")[-1])
高級用法
圖像質量評分
import torch
import requests
from PIL import Image
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration, AutoTokenizer
import numpy as np
def wa5(logits):
logprobs = np.array([logits["Excellent"], logits["Good"], logits["Fair"], logits["Poor"], logits["Bad"]])
probs = np.exp(logprobs) / np.sum(np.exp(logprobs))
return np.inner(probs, np.array([1, 0.75, 0.5, 0.25, 0]))
model_id = "zhangzicheng/q-sit-mini"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
toks = ["Excellent", "Good", "Fair", "Poor", "Bad"]
ids_ = [id_[0] for id_ in tokenizer(toks)["input_ids"]]
print("Rating token IDs:", ids_)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Assume you are an image quality evaluator. Your rating should be chosen from the following five categories: Excellent, Good, Fair, Poor, and Bad (from high to low). How would you rate the quality of this image?"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
raw_image = Image.open(requests.get("https://github.com/Q-Future/Q-SiT/blob/main/44009500.jpg?raw=true",stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)
prefix_text = "The quality of this image is "
prefix_ids = tokenizer(prefix_text, return_tensors="pt")["input_ids"].to(0)
inputs["input_ids"] = torch.cat([inputs["input_ids"], prefix_ids], dim=-1)
inputs["attention_mask"] = torch.ones_like(inputs["input_ids"])
output = model.generate(
**inputs,
max_new_tokens=1,
output_logits=True,
return_dict_in_generate=True,
)
last_logits = output.logits[-1][0]
logits_dict = {tok: last_logits[id_].item() for tok, id_ in zip(toks, ids_)}
weighted_score = wa5(logits_dict)
print("Weighted average score:", weighted_score)
如需數據集評估腳本,請參考此目錄。有關訓練信息,請參閱GitHub倉庫中的訓練Q-SiT部分。
📄 許可證
本項目採用MIT許可證。
📚 引用
如果您覺得我們的工作有用,請按以下格式引用我們的論文:
@misc{zhang2025teachinglmmsimagequality,
title={Teaching LMMs for Image Quality Scoring and Interpreting},
author={Zicheng Zhang and Haoning Wu and Ziheng Jia and Weisi Lin and Guangtao Zhai},
year={2025},
eprint={2503.09197},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.09197},
}