VideoScore-v1.1開源視頻質量評估模型 - 支持48幀推理，文本與視頻對齊評分佳

首頁

Videoscore V1.1

由TIGER-Lab開發

VideoScore-v1.1是基於Mantis-8B-Idefics2的視頻質量評估模型，支持48幀推理，在文本到視頻對齊子評分上表現優異。

文本生成視頻

Transformers

英語開源協議:MIT #視頻質量評估 #多維度評分 #文本-視頻對齊

下載量 703

發布時間 : 11/28/2024

模型概述

VideoScore系列是用於視頻質量評估的模型，能夠從多個維度評估AI生成視頻的質量，包括視覺質量、時間一致性、動態程度、文本到視頻對齊和事實一致性。

模型特點

多維度評估

能夠從視覺質量、時間一致性、動態程度、文本到視頻對齊和事實一致性五個維度評估視頻質量。

高幀數支持

支持處理48幀視頻，相比前代模型有顯著提升。

高性能

在VideoFeedback-test上達到74.0的Spearman相關性，超越GPT-4o等基線模型。

迴歸模型

直接輸出1.0-4.0的評分，而非分類結果。

模型能力

視頻質量評估

多維度評分

文本到視頻對齊分析

事實一致性檢查

使用案例

AI生成視頻評估

視頻生成模型評估

評估AI生成視頻的質量，為視頻生成模型提供反饋。

與人類評估高度一致，Spearman相關性達74.0

視頻內容審核

檢查生成視頻是否符合事實和常識。

在事實一致性維度提供可靠評分

視頻質量研究

視頻質量基準測試

為視頻質量研究提供標準化評估工具。

在GenAI-Bench和VBench上超越最佳基線

🚀 VideoScore-v1.1視頻質量評估模型

VideoScore-v1.1是一個視頻質量評估模型，以Mantis-8B-Idefics2為基礎模型，在大規模視頻評估數據集VideoFeedback上訓練得到。該模型能從多個維度對視頻質量進行評分，與人類評估高度一致，在多個基準測試中表現出色。

🚀 快速開始

你可以通過以下鏈接快速瞭解和使用VideoScore-v1.1：

VideoScore

✨ 主要特性

新版本優勢：嘗試使用新版本VideoScore-v1.1，它是VideoScore的變體，在“文本與視頻對齊”子分數方面表現更好，並且現在推理時支持48幀。它以Mantis-8B-Idefics2為基礎模型，在VideoFeedback數據集上進行訓練。
模型系列：VideoScore系列是視頻質量評估模型系列，以Mantis-8B-Idefics2或Qwen/Qwen2-VL為基礎模型，並在VideoFeedback（一個具有多方面人類評分的大型視頻評估數據集）上進行訓練。
評估表現：與VideoScore一樣，VideoScore-v1.1在VideoFeedback測試集上與人類評分的Spearman相關性約為75，超過了所有多模態大語言模型（MLLM）提示方法和基於特徵的指標。VideoScore-v1.1在另外兩個基準測試GenAI-Bench和VBench上也擊敗了最佳基線，顯示出與人類評估的高度一致性。有關這些基準測試的數據詳情，請參考VideoScore-Bench。
模型類型：VideoScore-v1.1是一個迴歸版本的模型。

📦 安裝指南

你可以使用以下命令安裝VideoScore：

pip install git+https://github.com/TIGER-AI-Lab/VideoScore.git
# 或者
# pip install mantis-vl

💻 使用示例

基礎用法

以下是一個使用VideoScore-v1.1進行推理的示例代碼：

import av
import numpy as np
from typing import List
from PIL import Image
import torch
from transformers import AutoProcessor
from mantis.models.idefics2 import Idefics2ForSequenceClassification
def _read_video_pyav(
    frame_paths:List[str], 
    max_frames:int,
):
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

ROUND_DIGIT=3
REGRESSION_QUERY_PROMPT = """
Suppose you are an expert in judging and evaluating the quality of AI-generated videos,
please watch the following frames of a given video and see the text prompt for generating the video,
then give scores from 5 different dimensions:
(1) visual quality: the quality of the video in terms of clearness, resolution, brightness, and color
(2) temporal consistency, both the consistency of objects or humans and the smoothness of motion or movements
(3) dynamic degree, the degree of dynamic changes
(4) text-to-video alignment, the alignment between the text prompt and the video content
(5) factual consistency, the consistency of the video content with the common-sense and factual knowledge
for each dimension, output a float number from 1.0 to 4.0,
the higher the number is, the better the video performs in that sub-score, 
the lowest 1.0 means Bad, the highest 4.0 means Perfect/Real (the video is like a real video)
Here is an output example:
visual quality: 3.2
temporal consistency: 2.7
dynamic degree: 4.0
text-to-video alignment: 2.3
factual consistency: 1.8
For this video, the text prompt is "{text_prompt}",
all the frames of video are as follows:
"""

# MAX_NUM_FRAMES=16
# model_name="TIGER-Lab/VideoScore"

# =======================================
# we support 48 frames in VideoScore-v1.1
# =======================================
MAX_NUM_FRAMES=48
model_name="TIGER-Lab/VideoScore-v1.1"

video_path="video1.mp4"
video_prompt="Near the Elephant Gate village, they approach the haunted house at night. Rajiv feels anxious, but Bhavesh encourages him. As they reach the house, a mysterious sound in the air adds to the suspense."

processor = AutoProcessor.from_pretrained(model_name,torch_dtype=torch.bfloat16)
model = Idefics2ForSequenceClassification.from_pretrained(model_name,torch_dtype=torch.bfloat16).eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# sample uniformly 8 frames from the video
container = av.open(video_path)
total_frames = container.streams.video[0].frames
if total_frames > MAX_NUM_FRAMES:
    indices = np.arange(0, total_frames, total_frames / MAX_NUM_FRAMES).astype(int)
else:
    indices = np.arange(total_frames)

frames = [Image.fromarray(x) for x in _read_video_pyav(container, indices)]
eval_prompt = REGRESSION_QUERY_PROMPT.format(text_prompt=video_prompt)
num_image_token = eval_prompt.count("<image>")
if num_image_token < len(frames):
    eval_prompt += "<image> " * (len(frames) - num_image_token)
flatten_images = []
for x in [frames]:
    if isinstance(x, list):
        flatten_images.extend(x)
    else:
        flatten_images.append(x)

flatten_images = [Image.open(x) if isinstance(x, str) else x for x in flatten_images]
inputs = processor(text=eval_prompt, images=flatten_images, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
logits = outputs.logits
num_aspects = logits.shape[-1]
aspect_scores = []
for i in range(num_aspects):
    aspect_scores.append(round(logits[0, i].item(),ROUND_DIGIT))

print(aspect_scores)
"""
model output on visual quality, temporal consistency, dynamic degree,
text-to-video alignment, factual consistency, respectively
VideoScore: 
[2.297, 2.469, 2.906, 2.766, 2.516]

VideoScore-v1.1:
[2.328, 2.484, 2.562, 1.969, 2.594]
"""

訓練

有關訓練的詳細信息，請參考VideoScore/training。

評估

有關評估的詳細信息，請參考VideoScore/benchmark。

📚 詳細文檔

評估結果

我們在VideoFeedback測試集上對VideoScore-v1.1進行了測試，並將模型輸出與人類評分在所有評估方面的Spearman相關性平均值作為指標。評估結果如下：

指標	VideoFeedback測試集
VideoScore-v1.1	74.0
Gemini-1.5-Pro	22.1
Gemini-1.5-Flash	20.8
GPT-4o	23.1
CLIP-sim	8.9
DINO-sim	7.5
SSIM-sim	13.4
CLIP-Score	-7.2
LLaVA-1.5-7B	8.5
LLaVA-1.6-7B	-3.1
X-CLIP-Score	-1.9
PIQE	-10.1
BRISQUE	-20.3
Idefics2	6.5
MSE-dyn	-5.5
SSIM-dyn	-12.9

VideoScore系列中的最佳結果用粗體表示，基線中的最佳結果用下劃線表示。

📄 許可證

本項目採用MIT許可證。

📖 引用

如果你使用了該模型或相關代碼，請引用以下論文：

@article{he2024videoscore,
  title = {VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation},
  author = {He, Xuan and Jiang, Dongfu and Zhang, Ge and Ku, Max and Soni, Achint and Siu, Sherman and Chen, Haonan and Chandra, Abhranil and Jiang, Ziyan and Arulraj, Aaran and Wang, Kai and Do, Quy Duc and Ni, Yuansheng and Lyu, Bohan and Narsupalli, Yaswanth and Fan, Rongqi and Lyu, Zhiheng and Lin, Yuchen and Chen, Wenhu},
  journal = {ArXiv},
  year = {2024},
  volume={abs/2406.15252},
  url = {https://arxiv.org/abs/2406.15252},
}