LLaVA-Video-7B-Qwen2-TPO Open-Source Video Understanding Model - Optimized for Temporal Preferences and Performs Well in Benchmark Tests

Llava Video 7B Qwen2 TPO

Developed by ruili0

LLaVA-Video-7B-Qwen2-TPO is a video understanding model based on LLaVA-Video-7B-Qwen2 with temporal preference optimization, demonstrating excellent performance across multiple benchmarks.

Video-to-Text

Transformers

Open Source License:MIT #Long Video Understanding #Temporal Preference Optimization #Multimodal Video Analysis

Downloads 490

Release Time : 1/16/2025

Model Overview

This model enhances long video understanding capabilities through temporal preference optimization technology, becoming a leading 7B parameter model in benchmarks like Video-MME.

Model Features

Temporal Preference Optimization

Significantly improves long video understanding through temporal preference optimization technology

High Performance

Demonstrates excellent performance in benchmarks such as LongVideoBench, MLVU, and VideoMME

Efficient Parameter Utilization

As a 7B parameter model, it matches or surpasses the performance of larger-scale models

Model Capabilities

Long video content understanding

Video content description generation

Multimodal video analysis

Use Cases

Video Content Analysis

Video Content Description

Provides detailed descriptions of video content

Generates accurate and comprehensive video content descriptions

Education

Educational Video Analysis

Analyzes educational video content and generates summaries

Helps students quickly grasp key points of the video

🚀 LLaVA-Video-7B-Qwen2-TPO

LLaVA-Video-7B-Qwen2-TPO, optimized by temporal preference based on LLaVA-Video-7B-Qwen2, achieves state - of - the - art performance in video understanding benchmarks.

LLaVA-Video-7B-Qwen2-TPO, introduced by paper Temporal Preference Optimization for Long-form Video Understanding, is optimized by temporal preference based on LLaVA-Video-7B-Qwen2. The LLaVA-Video-7B-Qwen2-TPO model establishes state-of-the-art performance across a range of benchmarks, demonstrating an average performance improvement of 1.5% compared to LLaVA-Video-7B. Notably, it emerges as the leading 7B parameter model on the Video-MME benchmark.

Project page: https://ruili33.github.io/tpo_website/ Code: https://github.com/ruili33/TPO

✨ Features

Optimized by temporal preference based on LLaVA-Video-7B-Qwen2.
Achieves state-of-the-art performance across multiple benchmarks.
Shows an average performance improvement of 1.5% compared to LLaVA-Video-7B.
Leads among 7B parameter models on the Video-MME benchmark.

📊 Evaluation Results

Model	Size	LongVideoBench	MLVU	VideoMME (Average)
NVILA [1]	7B	57.7	70.1	64.2/70.0
LLaVA-Video-7B [2]	7B	58.2	70.8	63.3/69.7
LLaVA-Video-7B-Qwen2-TPO	7B	60.1	71.1	65.6/71.5

🚀 Quick Start

Use the code below to get started with the model. For more information, please refer to our github repository.

💻 Usage Examples

Basic Usage

# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")
def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    frame_time = [i/fps for i in frame_idx]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    # import pdb;pdb.set_trace()
    return spare_frames,frame_time,video_time
pretrained = "ruili0/LLaVA-Video-7B-TPO"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
model.eval()
video_path = "local_demo/assets/dc_demo.mp4"
max_frames_num = "64"
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video = [video]
conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
question = DEFAULT_IMAGE_TOKEN + f"{time_instruciton}\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
    input_ids,
    images=video,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)

📄 License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models (Qwen2 license). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

📚 Documentation

Citation

BibTeX:

@article{li2025temporal,
      title={Temporal Preference Optimization for Long-Form Video Understanding},
      author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Wang, Zeyu and Yeung-Levy, Serena},
      journal={arXiv preprint arXiv:2501.13919},
      year={2025}
    }

References: [1]. Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., ... & Lu, Y. (2024). NVILA: Efficient Frontier Visual Language Models. arXiv preprint arXiv:2412.04468. [2]. Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., & Li, C. (2024). Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご