Llava Video 7B Qwen2
The LLaVA-Video model is a 7B-parameter multimodal model based on the Qwen2 language model, specializing in video understanding tasks and supporting 64-frame video input.
Downloads 34.28k
Release Time : 9/2/2024
Model Overview
This model is trained on the LLaVA-Video-178K and LLaVA-OneVision datasets, capable of interacting with images, multiple images, and videos, primarily targeting video understanding tasks.
Model Features
Multimodal Video Understanding
Supports processing video input and generating relevant text descriptions or answering questions
Long Context Support
Supports a context window of 32K tokens, capable of handling longer video content
Multi-Frame Processing Capability
Can process up to 64 frames of video input
Model Capabilities
Video Content Understanding
Video Q&A
Video Description Generation
Multimodal Reasoning
Use Cases
Video Understanding
Video Content Description
Generates detailed content descriptions based on input videos
Video Q&A
Answers various questions about video content
Performs excellently on multiple video Q&A datasets
datasets:
- lmms-lab/LLaVA-OneVision-Data
- lmms-lab/LLaVA-Video-178K language:
- en library_name: transformers license: apache-2.0 metrics:
- accuracy tags:
- multimodal pipeline_tag: video-text-to-text model-index:
- name: LLaVA-Video-7B-Qwen2
results:
- task:
type: multimodal
dataset:
name: ActNet-QA
type: actnet-qa
metrics:
- type: accuracy value: 56.5 name: accuracy verified: true
- task:
type: multimodal
dataset:
name: EgoSchema
type: egoschema
metrics:
- type: accuracy value: 57.3 name: accuracy verified: true
- task:
type: multimodal
dataset:
name: MLVU
type: mlvu
metrics:
- type: accuracy value: 70.8 name: accuracy verified: true
- task:
type: multimodal
dataset:
name: MVBench
type: mvbench
metrics:
- type: accuracy value: 58.6 name: accuracy verified: true
- task:
type: multimodal
dataset:
name: NextQA
type: nextqa
metrics:
- type: accuracy value: 83.2 name: accuracy verified: true
- task:
type: multimodal
dataset:
name: PercepTest
type: percepTest
metrics:
- type: accuracy value: 67.9 name: accuracy verified: true
- task:
type: multimodal
dataset:
name: VideoChatGPT
type: videochatgpt
metrics:
- type: score value: 3.52 name: score verified: true
- task:
type: multimodal
dataset:
name: VideoDC
type: videodc
metrics:
- type: score value: 3.66 name: score verified: true
- task:
type: multimodal
dataset:
name: LongVideoBench
type: longvideobench
metrics:
- type: accuracy value: 58.2 name: accuracy verified: true
- task:
type: multimodal
dataset:
name: VideoMME
type: videomme
metrics:
- type: accuracy value: 63.3 name: accuracy verified: true base_model:
- task:
type: multimodal
dataset:
name: ActNet-QA
type: actnet-qa
metrics:
- lmms-lab/llava-onevision-qwen2-7b-si
LLaVA-Video-7B-Qwen2
Table of Contents
Model Summary
The LLaVA-Video models are 7/72B parameter models trained on LLaVA-Video-178K and LLaVA-OneVision Dataset, based on Qwen2 language model with a context window of 32K tokens.
This model support at most 64 frames.
- Project Page: Project Page.
- Paper: For more details, please check our paper
- Repository: LLaVA-VL/LLaVA-NeXT
- Point of Contact: Yuanhan Zhang
- Languages: English, Chinese
Use
Intended use
The model was trained on LLaVA-Video-178K and LLaVA-OneVision Dataset, having the ability to interact with images, multi-image and videos, but specific to videos.
Feel free to share your generations in the Community tab!
Generation
We provide the simple generation process for using our model. For more details, you could refer to Github.
# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")
def load_video(video_path, max_frames_num,fps=1,force_sample=False):
if max_frames_num == 0:
return np.zeros((1, 336, 336, 3))
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
total_frame_num = len(vr)
video_time = total_frame_num / vr.get_avg_fps()
fps = round(vr.get_avg_fps()/fps)
frame_idx = [i for i in range(0, len(vr), fps)]
frame_time = [i/fps for i in frame_idx]
if len(frame_idx) > max_frames_num or force_sample:
sample_fps = max_frames_num
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frame_time = [i/vr.get_avg_fps() for i in frame_idx]
frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
spare_frames = vr.get_batch(frame_idx).asnumpy()
# import pdb;pdb.set_trace()
return spare_frames,frame_time,video_time
pretrained = "lmms-lab/LLaVA-Video-7B-Qwen2"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
video_path = "XXXX"
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().half()
video = [video]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\nPlease describe this video in detail."
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
input_ids,
images=video,
modalities= ["video"],
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(text_outputs)
Training
Model
- Architecture: SO400M + Qwen2
- Initialized Model: lmms-lab/llava-onevision-qwen2-7b-si
- Data: A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
- Precision: bfloat16
Hardware & Software
- GPUs: 256 * Nvidia Tesla A100 (for whole model series training)
- Orchestration: Huggingface Trainer
- Neural networks: PyTorch
Citation
@misc{zhang2024videoinstructiontuningsynthetic,
title={Video Instruction Tuning With Synthetic Data},
author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li},
year={2024},
eprint={2410.02713},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.02713},
}
Llava Video 7B Qwen2
Apache-2.0
The LLaVA-Video model is a 7B-parameter multimodal model based on the Qwen2 language model, specializing in video understanding tasks and supporting 64-frame video input.
Video-to-Text
Transformers English

L
lmms-lab
34.28k
91
Llava NeXT Video 7B DPO Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot optimized through mixed training on video and image data, possessing excellent video understanding capabilities.
Video-to-Text
Transformers English

L
llava-hf
12.61k
9
Internvideo2 5 Chat 8B
Apache-2.0
InternVideo2.5 is a video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, built upon InternVL2.5. It significantly improves existing MLLM models by enhancing the ability to perceive fine-grained details and capture long-term temporal structures.
Video-to-Text
Transformers English

I
OpenGVLab
8,265
60
Cogvlm2 Llama3 Caption
Other
CogVLM2-Caption is a video caption generation model used to generate training data for the CogVideoX model.
Video-to-Text
Transformers English

C
THUDM
7,493
95
Spacetimegpt
SpaceTime GPT is a video description generation model capable of spatial and temporal reasoning, analyzing video frames and generating sentences describing video events.
Video-to-Text
Transformers English

S
Neleac
2,877
33
Video R1 7B
Apache-2.0
Video-R1-7B is a multimodal large language model optimized based on Qwen2.5-VL-7B-Instruct, focusing on video reasoning tasks, capable of understanding video content and answering related questions.
Video-to-Text
Transformers English

V
Video-R1
2,129
9
Internvl 2 5 HiCo R16
Apache-2.0
InternVideo2.5 is a video multimodal large language model (MLLM) built upon InternVL2.5, enhanced with Long and Rich Context (LRC) modeling, capable of perceiving fine-grained details and capturing long-term temporal structures.
Video-to-Text
Transformers English

I
OpenGVLab
1,914
3
Videollm Online 8b V1plus
MIT
VideoLLM-online is a multimodal large language model based on Llama-3-8B-Instruct, focusing on online video understanding and video-text generation tasks.
Video-to-Text
Safetensors English
V
chenjoya
1,688
23
Videochat R1 7B
Apache-2.0
VideoChat-R1_7B is a multimodal video understanding model based on Qwen2.5-VL-7B-Instruct, capable of processing video and text inputs and generating text outputs.
Video-to-Text
Transformers English

V
OpenGVLab
1,686
7
Qwen2.5 Vl 7b Cam Motion Preview
Other
A camera motion analysis model fine-tuned based on Qwen2.5-VL-7B-Instruct, focusing on camera motion classification in videos and video-text retrieval tasks
Video-to-Text
Transformers

Q
chancharikm
1,456
10
Featured Recommended AI Models