🚀 LongVU
本仓库包含基于Qwen2 - 7B的模型,相关内容见论文LongVU: Spatiotemporal Adaptive Compression for Long Video - Language Understanding。该模型可用于长视频语言理解,通过时空自适应压缩技术提升性能。
你可以在HF演示中体验该模型。
📚 详细文档
数据集与基础模型
属性 |
详情 |
数据集 |
shenxq/OneVision、shenxq/VideoChat2 |
基础模型 |
Vision - CAIR/LongVU_Qwen2_7B_img |
任务类型 |
video - text - to - text |
模型评估结果
模型名为llava - onevision - qwen - 7b - ov,在多个数据集上进行了多模态任务评估,结果如下:
数据集名称 |
数据集类型 |
准确率 |
EgoSchema |
egoschema |
67.6% |
MLVU |
mlvu |
65.4% |
MVBench |
mvbench |
66.9% |
VideoMME |
videomme |
60.6% |
许可证
本项目采用apache - 2.0许可证。
💻 使用示例
基础用法
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (
DEFAULT_IMAGE_TOKEN,
IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
KeywordsStoppingCriteria,
process_images,
tokenizer_image_token,
)
from decord import cpu, VideoReader
tokenizer, model, image_processor, context_len = load_pretrained_model(
"./checkpoints/longvu_qwen", None, "cambrian_qwen",
)
model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
video = []
for frame_index in frame_indices:
img = vr[frame_index].asnumpy()
video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]
qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=video,
image_sizes=image_sizes,
do_sample=False,
temperature=0.2,
max_new_tokens=128,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
高级用法
如需更多详细信息,你可以参考Github。
📄 引用
如果你使用了本项目的模型或代码,请引用以下论文:
@article{shen2024longvu,
title={LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding},
author={Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas},
journal={arXiv:2410.17434},
year={2024}
}