Longva 7B TPO

Developed by ruili0

LongVA-7B-TPO is a video-text model derived from LongVA-7B through temporal preference optimization, excelling in long video understanding tasks.

Video-to-Text

Transformers

Open Source License:MIT #Long Video Understanding #Temporal Preference Optimization #Multimodal Generation

Downloads 225

Release Time : 1/14/2025

Model Overview

This model focuses on long video understanding tasks, enhancing performance on long video benchmarks through temporal preference optimization techniques.

Model Features

Temporal Preference Optimization

Significantly improves long video understanding capabilities through temporal preference optimization techniques

High Performance

Establishes state-of-the-art performance across multiple benchmarks, with an average 2% improvement over the base model

Multimodal Processing

Capable of processing both image and video inputs while generating corresponding text descriptions

Model Capabilities

Long video content understanding

Video content description generation

Image content description generation

Multimodal reasoning

Use Cases

Accessibility Services

Video Assistance for Visually Impaired

Provides detailed video content descriptions for visually impaired individuals

Delivers accurate video content descriptions

Video Content Analysis

Long Video Content Understanding

Analyzes temporal information and content in long videos

Accurately comprehends complex content in long videos

license: mit datasets:

ruili0/LongVA-TPO-10k base_model:
lmms-lab/LongVA-7B library_name: transformers pipeline_tag: video-text-to-text

LongVA-7B-TPO

This repository contains the model described in the paper Temporal Preference Optimization for Long-form Video Understanding.

LongVA-7B-TPO, introduced by paper Temporal Preference Optimization for Long-form Video Understanding, optimized by temporal preference based on LongVA-7B. The LongVA-7B-TPO model establishes state-of-the-art performance across a range of benchmarks, demonstrating an average performance improvement of 2% compared to LongVA-7B.

Evaluation Results

Model	Size	LongVideoBench	MLVU	VideoMME (Short)	VideoMME (Medium)	VideoMME (Long)	VideoMME (Average)
LongVA-7B [1]	7B	51.3	58.8	61.3/61.6	50.4/53.6	46.2/47.6	52.6/54.3
LongVA-TPO (ours)	7B	54.2	61.7	63.1/66.6	54.8/55.3	47.4/47.9	55.1/56.6

Get Started

Use the code below to get started with the model. For more information, please refer to our github repository.

from longva.model.builder import load_pretrained_model
from longva.mm_utils import tokenizer_image_token, process_images
from longva.constants import IMAGE_TOKEN_INDEX
from PIL import Image
from decord import VideoReader, cpu
import torch
import numpy as np
# fix seed
torch.manual_seed(0)

model_path = "ruili0/LongVA-TPO"
image_path = "local_demo/assets/lmms-eval.png"
video_path = "local_demo/assets/dc_demo.mp4"
max_frames_num = 16 # you can change this to several thousands so long you GPU memory can handle it :)
gen_kwargs = {"do_sample": True, "temperature": 0.5, "top_p": None, "num_beams": 1, "use_cache": True, "max_new_tokens": 1024}
# you can also set the device map to auto to accomodate more frames
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_qwen", device_map="cuda:0")


#image input
prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nDescribe the image in details.<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
image = Image.open(image_path).convert("RGB")
images_tensor = process_images([image], image_processor, model.config).to(model.device, dtype=torch.float16)
with torch.inference_mode():
    output_ids = model.generate(input_ids, images=images_tensor, image_sizes=[image.size], modalities=["image"], **gen_kwargs)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(outputs)
print("-"*50)

#video input
prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nGive a detailed caption of the video as if I am blind.<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
vr = VideoReader(video_path, ctx=cpu(0))
total_frame_num = len(vr)
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frames = vr.get_batch(frame_idx).asnumpy()
video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16)
with torch.inference_mode():
    output_ids = model.generate(input_ids, images=[video_tensor],  modalities=["video"], **gen_kwargs)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(outputs)

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models (Qwen2 license). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Citation

BibTeX:

@article{li2025temporal,
      title={Temporal Preference Optimization for Long-Form Video Understanding},
      author={Li, Rui and Wang, Xiaohan and Zhang, Yuhui and Wang, Zeyu and Yeung-Levy, Serena},
      journal={arXiv preprint arXiv:2501.13919},
      year={2025}
    }

References:

[1]. Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., ... & Liu, Z. (2024). Long context transfer from language to vision. arXiv preprint arXiv:2406.16852.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご