LongVU_Qwen2_7B Open - source Multimodal Model - Free Support for Long

Longvu Qwen2 7B

Developed by Vision-CAIR

LongVU is a multimodal model based on Qwen2-7B, focusing on long video language understanding tasks and employing spatio-temporal adaptive compression technology.

Video-to-Text

Safetensors

Open Source License:Apache-2.0 #Long Video Understanding #Spatio-Temporal Adaptive Compression #Multimodal Question Answering

Downloads 230

Release Time : 10/18/2024

Model Overview

This model combines visual and language processing capabilities, specifically designed for understanding and generating text descriptions related to long video content.

Model Features

Spatio-Temporal Adaptive Compression

Employs adaptive compression technology for long video content to improve processing efficiency.

Multimodal Understanding

Processes both video frames and text inputs simultaneously to achieve cross-modal understanding.

Long Video Processing

Specifically optimized for handling long video content while maintaining contextual consistency.

Model Capabilities

Video Content Understanding

Video Description Generation

Cross-Modal Reasoning

Long Video Processing

Use Cases

Video Content Analysis

Video Content Description

Generates detailed descriptions for long videos.

Can produce coherent video content summaries.

Video Question Answering

Answers complex questions about video content.

Performs excellently in multiple benchmark tests.

Education

Educational Video Analysis

Automatically analyzes educational video content and generates key learning points.

🚀 LongVU

This repository houses a model based on Qwen2 - 7B, as introduced in LongVU: Spatiotemporal Adaptive Compression for Long Video - Language Understanding. It offers a solution for long video - language understanding tasks. You can interact with the model on the [HF demo](https://huggingface.co/spaces/Vision - CAIR/LongVU).

📦 Installation

Not provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

# git clone https://github.com/Vision-CAIR/LongVU
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (
    DEFAULT_IMAGE_TOKEN,
    IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (
    KeywordsStoppingCriteria,
    process_images,
    tokenizer_image_token,
)
from decord import cpu, VideoReader

tokenizer, model, image_processor, context_len = load_pretrained_model(
    "./checkpoints/longvu_qwen", None, "cambrian_qwen",
)

model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"

vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
video = []
for frame_index in frame_indices:
    img = vr[frame_index].asnumpy()
    video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]

qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=video,
        image_sizes=image_sizes,
        do_sample=False,
        temperature=0.2,
        max_new_tokens=128,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
    )
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

📚 Documentation

We provide a simple generation process for using our model. For more details, you could refer to [Github](https://github.com/Vision - CAIR/LongVU)

📄 License

The model is licensed under the apache - 2.0 license.

📊 Model Information

Property	Details
Datasets	shenxq/OneVision, shenxq/VideoChat2
Base Model	Vision - CAIR/LongVU_Qwen2_7B_img
Pipeline Tag	video - text - to - text

📈 Model Results

Task Type	Dataset Name	Accuracy
Multimodal	EgoSchema	67.6
Multimodal	MLVU	65.4
Multimodal	MVBench	66.9
Multimodal	VideoMME	60.6

📖 Citation

@article{shen2024longvu,
    title={LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding},
    author={Shen, Xiaoqian and Xiong, Yunyang and Zhao, Changsheng and Wu, Lemeng and Chen, Jun and Zhu, Chenchen and Liu, Zechun and Xiao, Fanyi and Varadarajan, Balakrishnan and Bordes, Florian and Liu, Zhuang and Xu, Hu and J. Kim, Hyunwoo and Soran, Bilge and Krishnamoorthi, Raghuraman and Elhoseiny, Mohamed and Chandra, Vikas},
    journal={arXiv:2410.17434},
    year={2024}
  }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご