Video LLaVA 7B
Video-LLaVA is a multimodal model that unifies visual representations through pre-projection alignment learning, capable of handling visual reasoning tasks for both images and videos.
Downloads 2,066
Release Time : 11/17/2023
Model Overview
By binding unified visual representations to the language feature space, Video-LLaVA enables large language models to process visual reasoning tasks for both images and videos, demonstrating exceptional cross-modal interaction capabilities.
Model Features
Pre-projection Alignment
Achieves unified processing of images and videos by binding unified visual representations to the language feature space.
Cross-modal Interaction
Demonstrates exceptional cross-modal interaction capabilities despite the absence of image-video pairs in the dataset.
Modality Complementarity
Complementary learning between videos and images provides significant advantages over single-modality specialized models.
Model Capabilities
Image understanding and analysis
Video understanding and analysis
Multimodal reasoning
Visual question answering
Use Cases
Content Understanding
Video Content Analysis
Analyze video content and answer related questions
Capable of understanding actions, scenes, and events in videos
Image Content Understanding
Understand and describe image content
Capable of recognizing objects, scenes, and relationships in images
Education
Multimedia Teaching Assistance
Assist in understanding teaching videos and image content
Provides in-depth understanding of teaching materials
đ Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA demonstrates remarkable interactive capabilities between images and videos, enabling an LLM to perform visual reasoning on both modalities simultaneously.
đ Quick Start
If you like our project, please give us a star â on GitHub for the latest updates.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
đ° News
- [2024.01.27] đđđ Our MoE-LLaVA is released! A sparse model with 3B parameters outperformed the dense model with 7B parameters.
- [2024.01.17] đĨđĨđĨ Our LanguageBind has been accepted at ICLR 2024!
- [2024.01.16] đĨđĨđĨ We reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh.
- [2023.11.30] đ¤ Thanks to the generous contributions of the community, the OpenXLab's demo is now accessible.
- [2023.11.23] We are training a new and powerful model.
- [2023.11.21] đ¤ Check out the replicate demo, created by @nateraw, who has generously supported our research!
- [2023.11.20] đ¤ Hugging Face demo and all codes & datasets are available now! Welcome to watch đ this repository for the latest updates.
⨠Features
đĄ Simple baseline, learning united visual representation by alignment before projection
- With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.
đĨ High performance, complementary learning with video and image
- Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.
đ¤ Demo
Gradio Web UI
Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Video-LLaVA. We also provide online demo in Huggingface Spaces.
python -m videollava.serve.gradio_web_server
CLI Inference
python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/video.mp4" --load-4bit
python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/image.jpg" --load-4bit
đĻ Installation
- Python >= 3.10
- Pytorch == 2.0.1
- CUDA Version >= 11.7
- Install required packages:
git clone https://github.com/PKU-YuanGroup/Video-LLaVA
cd Video-LLaVA
conda create -n videollava python=3.10 -y
conda activate videollava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d
đģ Usage Examples
Basic Usage
Inference for image
import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
def main():
disable_torch_init()
image = 'videollava/serve/examples/extreme_ironing.jpg'
inp = 'What is unusual about this image?'
model_path = 'LanguageBind/Video-LLaVA-7B'
cache_dir = 'cache_dir'
device = 'cuda'
load_4bit, load_8bit = True, False
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
image_processor = processor['image']
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()
roles = conv.roles
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
if type(image_tensor) is list:
tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
else:
tensor = image_tensor.to(model.device, dtype=torch.float16)
print(f"{roles[1]}: {inp}")
inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=tensor,
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria])
outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
print(outputs)
if __name__ == '__main__':
main()
Inference for video
import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
def main():
disable_torch_init()
video = 'videollava/serve/examples/sample_demo_1.mp4'
inp = 'Why is this video funny?'
model_path = 'LanguageBind/Video-LLaVA-7B'
cache_dir = 'cache_dir'
device = 'cuda'
load_4bit, load_8bit = True, False
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
video_processor = processor['video']
conv_mode = "llava_v1"
conv = conv_templates[conv_mode].copy()
roles = conv.roles
video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
if type(video_tensor) is list:
tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
else:
tensor = video_tensor.to(model.device, dtype=torch.float16)
print(f"{roles[1]}: {inp}")
inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + inp
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=tensor,
do_sample=True,
temperature=0.1,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria])
outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
print(outputs)
if __name__ == '__main__':
main()
đ Documentation
The training & validating instruction is in TRAIN_AND_VALIDATE.md.
đ Acknowledgement
- LLaVA The codebase we built upon and it is an efficient large language and vision assistant.
- Video-ChatGPT Great job contributing the evaluation code and dataset.
đ Related Projects
- LanguageBind An open source five modalities language-based retrieval framework.
- Chat-UniVi This framework empowers the model to efficiently utilize a limited number of visual tokens.
đ License
- The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
- The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.
âī¸ Citation
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.
@article{lin2023video,
title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
journal={arXiv preprint arXiv:2311.10122},
year={2023}
}
@article{zhu2023languagebind,
title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
author={Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and Cui, Jiaxi and Wang, HongFa and Pang, Yatian and Jiang, Wenhao and Zhang, Junwu and Li, Zongwei and others},
journal={arXiv preprint arXiv:2310.01852},
year={2023}
}
⨠Star History
đ¤ Contributors
Xclip Base Patch32
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained on (video, text) pairs via contrastive learning, suitable for tasks like video classification and video-text retrieval.
Text-to-Video
Transformers English

X
microsoft
309.80k
84
LTX Video
Other
The first DiT-based video generation model capable of real-time generation of high-quality videos, supporting two scenarios: text-to-video and image + text-to-video.
Text-to-Video English
L
Lightricks
165.42k
1,174
Wan2.1 14B VACE GGUF
Apache-2.0
The GGUF format version of the Wan2.1-VACE-14B model, mainly used for text-to-video generation tasks.
Text-to-Video
W
QuantStack
146.36k
139
Animatediff Lightning
Openrail
Ultra-fast text-to-video model, generating videos over ten times faster than the original AnimateDiff
Text-to-Video
A
ByteDance
144.00k
925
V Express
V-Express is an audio and facial keypoint condition-based video generation model capable of converting audio input into dynamic video output.
Text-to-Video English
V
tk93
118.36k
85
Cogvideox 5b
Other
CogVideoX is the open-source version of the video generation model derived from Qingying, providing high-quality video generation capabilities.
Text-to-Video English
C
THUDM
92.32k
611
Llava NeXT Video 7B Hf
LLaVA-NeXT-Video is an open-source multimodal chatbot that achieves excellent video understanding capabilities through mixed training on video and image data, reaching SOTA level among open-source models on the VideoMME benchmark.
Text-to-Video
Transformers English

L
llava-hf
65.95k
88
Wan2.1 T2V 14B Diffusers
Apache-2.0
Wan2.1 is a comprehensive open-source video foundation model designed to push the boundaries of video generation, supporting tasks such as text-to-video in Chinese and English, image-to-video, and more.
Text-to-Video Supports Multiple Languages
W
Wan-AI
48.65k
24
Wan2.1 T2V 1.3B Diffusers
Apache-2.0
Wan 2.1 is a comprehensive open-source video foundation model featuring top-tier performance, consumer-grade GPU support, multi-task capabilities, visual-text generation, and efficient video VAE.
Text-to-Video Supports Multiple Languages
W
Wan-AI
45.29k
38
Wan2.1 T2V 14B
Apache-2.0
Wan 2.1 is a comprehensive open-source video foundation model capable of multiple tasks including text-to-video, image-to-video, video editing, text-to-image, and video-to-audio generation, with support for bilingual Chinese-English text input.
Text-to-Video Supports Multiple Languages
W
Wan-AI
44.88k
1,238
Featured Recommended AI Models
Š 2025AIbase