LLaVA-Video-7B-Qwen2開源視頻理解模型

首頁

Llava Video 7B Qwen2

由lmms-lab開發

LLaVA-視頻模型是基於Qwen2語言模型的7B參數多模態模型，專注於視頻理解任務，支持64幀視頻輸入。

視頻生成文本

Transformers

英語開源協議:Apache-2.0 #視頻問答 #多模態交互 #長視頻理解

下載量 34.28k

發布時間 : 9/2/2024

模型概述

該模型在LLaVA-視頻-178K和LLaVA-OneVision數據集上訓練，具備與圖像、多圖像和視頻交互的能力，主要針對視頻理解任務。

模型特點

多模態視頻理解

支持處理視頻輸入並生成相關文本描述或回答問題

長上下文支持

支持32K tokens的上下文窗口，可處理較長視頻內容

多幀處理能力

最多可處理64幀視頻輸入

模型能力

視頻內容理解

視頻問答

視頻描述生成

多模態推理

使用案例

視頻理解

視頻內容描述

根據輸入視頻生成詳細的內容描述

視頻問答

回答關於視頻內容的各類問題

在多個視頻問答數據集上表現優異

🚀 LLaVA-Video-7B-Qwen2

LLaVA-Video-7B-Qwen2 是基於 Qwen2 語言模型的多模態模型，在圖像、多圖像和視頻交互方面表現出色，尤其專注於視頻處理。它在多個多模態數據集上進行了訓練和測試，具有較高的準確性。

🚀 快速開始

模型概述

LLaVA-Video 系列模型是參數為 7/72B 的模型，在 LLaVA-Video-178K 和 LLaVA-OneVision 數據集上進行訓練，基於 Qwen2 語言模型，上下文窗口為 32K 個標記。

此模型最多支持 64 幀。

項目頁面：項目頁面
論文：更多詳情請查看我們的論文
代碼倉庫：LLaVA-VL/LLaVA-NeXT
聯繫人：張元瀚
支持語言：英語、中文

模型使用

預期用途

該模型在 LLaVA-Video-178K 和 LLaVA-OneVision 數據集上進行訓練，具備與圖像、多圖像和視頻進行交互的能力，尤其專注於視頻處理。

歡迎在社區板塊分享你的生成結果！

生成示例

我們提供了使用該模型的簡單生成流程。更多詳情可參考 Github。

# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")
def load_video(video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    frame_time = [i/fps for i in frame_idx]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time
pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
video_path = "XXXX"
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().half()
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

🔧 技術細節

訓練信息

模型

架構：SO400M + Qwen2
初始化模型：lmms-lab/llava-onevision-qwen2-7b-si
訓練數據：160 萬個單圖像/多圖像/視頻數據的混合，1 個訓練週期，全量模型訓練
精度：bfloat16

硬件與軟件

GPU：256 塊英偉達 Tesla A100（用於整個模型系列的訓練）
編排工具：Huggingface Trainer
神經網絡框架：PyTorch

評估指標

數據集名稱	任務類型	指標類型	指標值
ActNet-QA	多模態	準確率	56.5
EgoSchema	多模態	準確率	57.3
MLVU	多模態	準確率	70.8
MVBench	多模態	準確率	58.6
NextQA	多模態	準確率	83.2
PercepTest	多模態	準確率	67.9
VideoChatGPT	多模態	得分	3.52
VideoDC	多模態	得分	3.66
LongVideoBench	多模態	準確率	58.2
VideoMME	多模態	準確率	63.3

📄 許可證

本項目採用 Apache-2.0 許可證。

📖 引用

@misc{zhang2024videoinstructiontuningsynthetic,
    title={Video Instruction Tuning With Synthetic Data}, 
    author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li},
    year={2024},
    eprint={2410.02713},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.02713}, 
}