Open-source VideoLLaMA2-72B Multimodal Model - A Visual Question-answering Dialogue Tool Supporting Video and Image Inputs

Videollama2 72B

Developed by DAMO-NLP-SG

VideoLLaMA 2 is a multimodal large language model focused on video understanding and spatio-temporal modeling, supporting video and image inputs, capable of performing visual question answering and dialogue tasks.

Text-to-Video

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Video Understanding #Spatio-Temporal Modeling Enhancement #Audio-Visual Fusion

Downloads 26

Release Time : 8/13/2024

Model Overview

VideoLLaMA 2 is an advanced multimodal large language model specializing in video understanding and spatio-temporal modeling. It combines a visual encoder and a language decoder to process video and image inputs, performing tasks such as visual question answering and video description.

Model Features

Multimodal Understanding

Capable of processing both video and image inputs, understanding visual content, and engaging in natural language interactions.

Spatio-Temporal Modeling

Specially optimized for understanding and processing spatio-temporal information in videos.

Large-Scale Parameters

A powerful 72B-parameter language model providing deep semantic understanding and generation capabilities.

Instruction Following

Fine-tuned to accurately understand and execute various user instructions related to visual tasks.

Model Capabilities

Video Question Answering

Image Question Answering

Video Content Description

Image Content Description

Multimodal Dialogue

Spatio-Temporal Relationship Understanding

Use Cases

Video Understanding

Video Content Question Answering

Answering various questions about video content, such as identifying objects, analyzing actions, and understanding scenes.

Accurately identifies animals and their behaviors in videos and describes the overall atmosphere.

Video Summary Generation

Automatically generating textual descriptions and summaries of video content.

Image Understanding

Image Content Question Answering

Answering various questions about image content, such as identifying objects, analyzing scenes, and understanding emotions.

Accurately describes the clothing and behavior of people in images and analyzes the emotional atmosphere.

🚀 VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

VideoLLaMA 2 is a multimodal large language model that enhances spatial - temporal modeling and audio understanding in video - related tasks, offering solutions for visual question - answering and other video - based applications.

🚀 Quick Start

The project provides model weights, training, evaluation, and serving codes. You can start using it by referring to the inference code example below.

✨ Features

Multimodal Capability: It is a multimodal large language model, supporting both video and image inference.
Rich Model Zoo: Offers a variety of model configurations with different language decoders and training frame numbers to meet diverse application needs.

📦 Installation

The original README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
   
    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    
    model_path = 'DAMO-NLP-SG/VideoLLaMA2-72B'
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

Advanced Usage

The original README does not provide advanced usage examples, so this part is not expanded.

📚 Documentation

📰 News

[2024.06.12] Release model weights and the first version of the technical report of VideoLLaMA 2.
[2024.06.03] Release training, evaluation, and serving codes of VideoLLaMA 2.

🌎 Model Zoo

Property	Details
Model Name	VideoLLaMA2-7B-Base, VideoLLaMA2-7B, VideoLLaMA2-7B-16F-Base, VideoLLaMA2-7B-16F, VideoLLaMA2-8x7B-Base, VideoLLaMA2-8x7B, VideoLLaMA2-72B-Base, VideoLLaMA2-72B
Type	Base, Chat
Visual Encoder	clip-vit-large-patch14-336
Language Decoder	Mistral-7B-Instruct-v0.2, Mixtral-8x7B-Instruct-v0.1, Qwen2-72B-Instruct
# Training Frames	8, 16

🚀 Main Results

Multi - Choice Video QA & Video Captioning

Open - Ended Video QA

🔧 Technical Details

The original README does not provide specific technical details, so this section is skipped.

📄 License

The project is licensed under the Apache - 2.0 license.

Citation

If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}
@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

If you like our project, please give us a star ⭐ on Github for the latest update.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご