Video-LLaVA - 7Bオープンソースマルチモーダルモデル - 画像・ビデオの視覚推論タスクを無料でデプロイして処理

ホーム

Video LLaVA 7B

LanguageBindによって開発

Video-LLaVAは投影前アライメント学習により視覚表現を統一するマルチモーダルモデルで、画像と動画の視覚推論タスクを同時に処理できます。

テキスト生成ビデオ

Transformers

オープンソースライセンス:Apache-2.0 #マルチモーダル統一表現 #動画画像連合推論 #投影前アライメント学習

ダウンロード数 2,066

リリース時間 : 11/17/2023

モデル概要

Video-LLaVAは統一視覚表現を言語特徴空間にバインドすることで、大規模言語モデルが画像と動画の視覚推論タスクを同時に処理できるようになり、優れたクロスモーダルインタラクション能力を示します。

モデル特徴

投影前アライメント

統一視覚表現を言語特徴空間にバインドすることで、画像と動画の統一処理を実現

クロスモーダルインタラクション

データセットに画像-動画ペアが含まれていないにもかかわらず、優れたクロスモーダルインタラクション能力を示す

モーダル相補性

動画と画像の相補的学習により、単一モーダル専用モデルに比べて顕著な優位性を持つ

モデル能力

画像理解と分析

動画理解と分析

マルチモーダル推論

視覚的質問応答

使用事例

コンテンツ理解

動画コンテンツ分析

動画コンテンツを分析し関連質問に回答

動画中の動作、シーン、イベントを理解可能

画像コンテンツ理解

画像コンテンツを理解し説明

画像中のオブジェクト、シーン、関係を認識可能

教育

マルチメディア教育支援

教育用動画や画像コンテンツの理解を支援

教材に対する深い理解を提供

license: apache-2.0

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

If you like our project, please give us a star ⭐ on GitHub for latest update.

📰 News

[2024.01.27] 👀👀👀 Our MoE-LLaVA is released! A sparse model with 3B parameters outperformed the dense model with 7B parameters.
[2024.01.17] 🔥🔥🔥 Our LanguageBind has been accepted at ICLR 2024!
[2024.01.16] 🔥🔥🔥 We reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh.
[2023.11.30] 🤝 Thanks to the generous contributions of the community, the OpenXLab's demo is now accessible.
[2023.11.23] We are training a new and powerful model.
[2023.11.21] 🤝 Check out the replicate demo, created by @nateraw, who has generously supported our research!
[2023.11.20] 🤗 Hugging Face demo and all codes & datasets are available now! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights

Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset.

💡 Simple baseline, learning united visual representation by alignment before projection

With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.

🔥 High performance, complementary learning with video and image

Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by Video-LLaVA. We also provide online demo in Huggingface Spaces.

python -m  videollava.serve.gradio_web_server

CLI Inference

python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/video.mp4" --load-4bit

python -m videollava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --file "path/to/your/image.jpg" --load-4bit

🛠️ Requirements and Installation

Python >= 3.10
Pytorch == 2.0.1
CUDA Version >= 11.7
Install required packages:

git clone https://github.com/PKU-YuanGroup/Video-LLaVA
cd Video-LLaVA
conda create -n videollava python=3.10 -y
conda activate videollava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d

🤖 API

We open source all codes. If you want to load the model (e.g. LanguageBind/Video-LLaVA-7B) on local, you can use the following code snippets.

Inference for image

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'videollava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    image_processor = processor['image']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values']
    if type(image_tensor) is list:
        tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
    else:
        tensor = image_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

Inference for video

import torch
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.model.builder import load_pretrained_model
from videollava.utils import disable_torch_init
from videollava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    video = 'videollava/serve/examples/sample_demo_1.mp4'
    inp = 'Why is this video funny?'
    model_path = 'LanguageBind/Video-LLaVA-7B'
    cache_dir = 'cache_dir'
    device = 'cuda'
    load_4bit, load_8bit = True, False
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device, cache_dir=cache_dir)
    video_processor = processor['video']
    conv_mode = "llava_v1"
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
    if type(video_tensor) is list:
        tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
    else:
        tensor = video_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.1,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

🗝️ Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

👍 Acknowledgement

LLaVA The codebase we built upon and it is an efficient large language and vision assistant.
Video-ChatGPT Great job contributing the evaluation code and dataset.

🙌 Related Projects

LanguageBind An open source five modalities language-based retrieval framework.
Chat-UniVi This framework empowers the model to efficiently utilize a limited number of visual tokens.

🔒 License

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

@article{lin2023video,
  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
  journal={arXiv preprint arXiv:2311.10122},
  year={2023}
}

@article{zhu2023languagebind,
  title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
  author={Zhu, Bin and Lin, Bin and Ning, Munan and Yan, Yang and Cui, Jiaxi and Wang, HongFa and Pang, Yatian and Jiang, Wenhao and Zhang, Junwu and Li, Zongwei and others},
  journal={arXiv preprint arXiv:2310.01852},
  year={2023}
}