🚀 LongVU
本倉庫包含基於Qwen2 - 7B的模型,相關內容見論文LongVU: Spatiotemporal Adaptive Compression for Long Video - Language Understanding。該模型可用於長視頻語言理解,通過時空自適應壓縮技術提升性能。
你可以在HF演示中體驗該模型。
📚 詳細文檔
數據集與基礎模型
屬性 |
詳情 |
數據集 |
shenxq/OneVision、shenxq/VideoChat2 |
基礎模型 |
Vision - CAIR/LongVU_Qwen2_7B_img |
任務類型 |
video - text - to - text |
模型評估結果
模型名為llava - onevision - qwen - 7b - ov,在多個數據集上進行了多模態任務評估,結果如下:
數據集名稱 |
數據集類型 |
準確率 |
EgoSchema |
egoschema |
67.6% |
MLVU |
mlvu |
65.4% |
MVBench |
mvbench |
66.9% |
VideoMME |
videomme |
60.6% |
許可證
本項目採用apache - 2.0許可證。
💻 使用示例
基礎用法
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (
DEFAULT_IMAGE_TOKEN,
IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
KeywordsStoppingCriteria,
process_images,
tokenizer_image_token,
)
from decord import cpu, VideoReader
tokenizer, model, image_processor, context_len = load_pretrained_model(
"./checkpoints/longvu_qwen", None, "cambrian_qwen",
)
model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
video = []
for frame_index in frame_indices:
img = vr[frame_index].asnumpy()
video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]
qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=video,
image_sizes=image_sizes,
do_sample=False,
temperature=0.2,
max_new_tokens=128,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
高級用法
如需更多詳細信息,你可以參考Github。
📄 引用
如果你使用了本項目的模型或代碼,請引用以下論文:
@article{shen2024longvu,
title={LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding},
author={Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas},
journal={arXiv:2410.17434},
year={2024}
}