Video-LLaVA-7B Open Source Multimodal Model - Freely Deploy to Handle Image, Video, and Visual Reasoning Tasks

Video LLaVA 7B

Developed by LanguageBind

Video-LLaVA is a multimodal model that unifies visual representations through pre-projection alignment learning, capable of handling visual reasoning tasks for both images and videos.

Text-to-Video

Transformers

Open Source License:Apache-2.0 #Multimodal Unified Representation #Joint Video-Image Reasoning #Pre-projection Alignment Learning

Downloads 2,066

Release Time : 11/17/2023

Model Overview

By binding unified visual representations to the language feature space, Video-LLaVA enables large language models to process visual reasoning tasks for both images and videos, demonstrating exceptional cross-modal interaction capabilities.

Model Features

Pre-projection Alignment

Achieves unified processing of images and videos by binding unified visual representations to the language feature space.

Cross-modal Interaction

Demonstrates exceptional cross-modal interaction capabilities despite the absence of image-video pairs in the dataset.

Modality Complementarity

Complementary learning between videos and images provides significant advantages over single-modality specialized models.

Model Capabilities

Image understanding and analysis

Video understanding and analysis

Multimodal reasoning

Visual question answering

Use Cases

Content Understanding

Video Content Analysis

Analyze video content and answer related questions

Capable of understanding actions, scenes, and events in videos

Image Content Understanding

Understand and describe image content

Capable of recognizing objects, scenes, and relationships in images

Education

Multimedia Teaching Assistance

Assist in understanding teaching videos and image content

Provides in-depth understanding of teaching materials

🚀 Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Video-LLaVA demonstrates remarkable interactive capabilities between images and videos, enabling an LLM to perform visual reasoning on both modalities simultaneously.

🚀 Quick Start

If you like our project, please give us a star ⭐ on GitHub for the latest updates.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

📰 News

[2024.01.27] 👀👀👀 Our MoE-LLaVA is released! A sparse model with 3B parameters outperformed the dense model with 7B parameters.
[2024.01.17] 🔥🔥🔥 Our LanguageBind has been accepted at ICLR 2024!
[2024.01.16] 🔥🔥🔥 We reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh.
[2023.11.30] 🤝 Thanks to the generous contributions of the community, the OpenXLab's demo is now accessible.
[2023.11.23] We are training a new and powerful model.
[2023.11.21] 🤝 Check out the replicate demo, created by @nateraw, who has generously supported our research!
[2023.11.20] 🤗 Hugging Face demo and all codes & datasets are available now! Welcome to watch 👀 this repository for the latest updates.

✨ Features

💡 Simple baseline, learning united visual representation by alignment before projection

With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.

🔥 High performance, complementary learning with video and image

Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Video-LLaVA. We also provide online demo in Huggingface Spaces.

python -m  videollava.serve.gradio_web_server

CLI Inference

python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/video.mp4" --load-4bit

python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/image.jpg" --load-4bit

📦 Installation

Python >= 3.10
Pytorch == 2.0.1
CUDA Version >= 11.7
Install required packages:

git clone https://github.com/PKU-YuanGroup/Video-LLaVA
cd Video-LLaVA
conda create -n videollava python=3.10 -y
conda activate videollava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d

💻 Usage Examples

Basic Usage

Inference for image

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'videollava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    image_processor = processor['image']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
    if type(image_tensor) is list:
        tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
    else:
        tensor = image_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

Inference for video

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    video = 'videollava/serve/examples/sample_demo_1.mp4'
    inp = 'Why is this video funny?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    video_processor = processor['video']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
    if type(video_tensor) is list:
        tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
    else:
        tensor = video_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.1,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

📚 Documentation

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

👍 Acknowledgement

LLaVA The codebase we built upon and it is an efficient large language and vision assistant.
Video-ChatGPT Great job contributing the evaluation code and dataset.

🙌 Related Projects

LanguageBind An open source five modalities language-based retrieval framework.
Chat-UniVi This framework empowers the model to efficiently utilize a limited number of visual tokens.

📄 License

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

@article{lin2023video,
  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
  journal={arXiv preprint arXiv:2311.10122},
  year={2023}
}

@article{zhu2023languagebind,
  title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
  author={Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and Cui, Jiaxi and Wang, HongFa and Pang, Yatian and Jiang, Wenhao and Zhang, Junwu and Li, Zongwei and others},
  journal={arXiv preprint arXiv:2310.01852},
  year={2023}
}

✨ Star History

🤝 Contributors

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご