đ LongVU
This repository houses a model based on Qwen2 - 7B, as introduced in LongVU: Spatiotemporal Adaptive Compression for Long Video - Language Understanding. It offers a solution for long video - language understanding tasks. You can interact with the model on the [HF demo](https://huggingface.co/spaces/Vision - CAIR/LongVU).
đĻ Installation
Not provided in the original README, so this section is skipped.
đģ Usage Examples
Basic Usage
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (
DEFAULT_IMAGE_TOKEN,
IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
KeywordsStoppingCriteria,
process_images,
tokenizer_image_token,
)
from decord import cpu, VideoReader
tokenizer, model, image_processor, context_len = load_pretrained_model(
"./checkpoints/longvu_qwen", None, "cambrian_qwen",
)
model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
video = []
for frame_index in frame_indices:
img = vr[frame_index].asnumpy()
video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]
qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=video,
image_sizes=image_sizes,
do_sample=False,
temperature=0.2,
max_new_tokens=128,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
đ Documentation
We provide a simple generation process for using our model. For more details, you could refer to [Github](https://github.com/Vision - CAIR/LongVU)
đ License
The model is licensed under the apache - 2.0
license.
đ Model Information
Property |
Details |
Datasets |
shenxq/OneVision, shenxq/VideoChat2 |
Base Model |
Vision - CAIR/LongVU_Qwen2_7B_img |
Pipeline Tag |
video - text - to - text |
đ Model Results
Task Type |
Dataset Name |
Accuracy |
Multimodal |
EgoSchema |
67.6 |
Multimodal |
MLVU |
65.4 |
Multimodal |
MVBench |
66.9 |
Multimodal |
VideoMME |
60.6 |
đ Citation
@article{shen2024longvu,
title={LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding},
author={Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas},
journal={arXiv:2410.17434},
year={2024}
}