LLaVA-Video-7B-Qwen2开源视频理解模型

首页

Llava Video 7B Qwen2

由 lmms-lab 开发

LLaVA-视频模型是基于Qwen2语言模型的7B参数多模态模型，专注于视频理解任务，支持64帧视频输入。

视频生成文本

Transformers

英语开源协议:Apache-2.0 #视频问答 #多模态交互 #长视频理解

下载量 34.28k

发布时间 : 9/2/2024

模型简介

该模型在LLaVA-视频-178K和LLaVA-OneVision数据集上训练，具备与图像、多图像和视频交互的能力，主要针对视频理解任务。

模型特点

多模态视频理解

支持处理视频输入并生成相关文本描述或回答问题

长上下文支持

支持32K tokens的上下文窗口，可处理较长视频内容

多帧处理能力

最多可处理64帧视频输入

模型能力

视频内容理解

视频问答

视频描述生成

多模态推理

使用案例

视频理解

视频内容描述

根据输入视频生成详细的内容描述

视频问答

回答关于视频内容的各类问题

在多个视频问答数据集上表现优异

🚀 LLaVA-Video-7B-Qwen2

LLaVA-Video-7B-Qwen2 是基于 Qwen2 语言模型的多模态模型，在图像、多图像和视频交互方面表现出色，尤其专注于视频处理。它在多个多模态数据集上进行了训练和测试，具有较高的准确性。

🚀 快速开始

模型概述

LLaVA-Video 系列模型是参数为 7/72B 的模型，在 LLaVA-Video-178K 和 LLaVA-OneVision 数据集上进行训练，基于 Qwen2 语言模型，上下文窗口为 32K 个标记。

此模型最多支持 64 帧。

项目页面：项目页面
论文：更多详情请查看我们的论文
代码仓库：LLaVA-VL/LLaVA-NeXT
联系人：张元瀚
支持语言：英语、中文

模型使用

预期用途

该模型在 LLaVA-Video-178K 和 LLaVA-OneVision 数据集上进行训练，具备与图像、多图像和视频进行交互的能力，尤其专注于视频处理。

欢迎在社区板块分享你的生成结果！

生成示例

我们提供了使用该模型的简单生成流程。更多详情可参考 Github。

# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")
def load_video(video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    frame_time = [i/fps for i in frame_idx]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time
pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
video_path = "XXXX"
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().half()
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

🔧 技术细节

训练信息

模型

架构：SO400M + Qwen2
初始化模型：lmms-lab/llava-onevision-qwen2-7b-si
训练数据：160 万个单图像/多图像/视频数据的混合，1 个训练周期，全量模型训练
精度：bfloat16

硬件与软件

GPU：256 块英伟达 Tesla A100（用于整个模型系列的训练）
编排工具：Huggingface Trainer
神经网络框架：PyTorch

评估指标

数据集名称	任务类型	指标类型	指标值
ActNet-QA	多模态	准确率	56.5
EgoSchema	多模态	准确率	57.3
MLVU	多模态	准确率	70.8
MVBench	多模态	准确率	58.6
NextQA	多模态	准确率	83.2
PercepTest	多模态	准确率	67.9
VideoChatGPT	多模态	得分	3.52
VideoDC	多模态	得分	3.66
LongVideoBench	多模态	准确率	58.2
VideoMME	多模态	准确率	63.3

📄 许可证

本项目采用 Apache-2.0 许可证。

📖 引用

@misc{zhang2024videoinstructiontuningsynthetic,
    title={Video Instruction Tuning With Synthetic Data}, 
    author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li},
    year={2024},
    eprint={2410.02713},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.02713}, 
}