InternVideo2-Chat-8B Open-Source Video Understanding Model - Free Support for Video Semantic Understanding and Human-Computer Interaction

Internvideo2 Chat 8B

Developed by OpenGVLab

InternVideo2-Chat-8B is a video understanding model that combines a large language model (LLM) with video BLIP, built through a progressive learning scheme, capable of video semantic understanding and human-computer interaction.

Video-to-Text

Transformers

EnglishOpen Source License:MIT #Video Semantic Understanding #Progressive Learning #Multimodal Interaction

Downloads 492

Release Time : 8/1/2024

Model Overview

The model uses InternVideo2 as a video encoder and integrates it with large language models like Mistral-7B to construct VideoLLM for fine-tuning, enhancing video semantic comprehension and human-computer interaction friendliness.

Model Features

Progressive Learning Scheme

Adopts VideoChat's progressive learning scheme to train the video BLIP module to interact with open-source LLMs, with continuous updates to the video encoder.

High-Performance Video Understanding

Excels in benchmarks like MVBench and VideoMME, accurately understanding video content and performing semantic analysis.

Multimodal Interaction

Combines video and text inputs to support complex multimodal interaction tasks, such as video content description and Q&A.

Model Capabilities

Video Content Understanding

Video Q&A

Video Content Description

Multimodal Interaction

Use Cases

Video Analysis

Video Content Description

Provides detailed descriptions of video content, such as action details and scene information.

The video shows a woman practicing yoga on a rooftop overlooking a mountain view. She starts in a hands-and-knees position, transitions into downward dog, and ends in a standing pose.

Video Q&A

Answers specific questions about video content, such as clothing or action details.

The woman in the video is wearing a black tank top and gray yoga pants.

Human-Computer Interaction

Natural Language Interaction

Supports interaction via natural language to obtain detailed information about video content.

🚀 InternVideo2-Chat-8B

This project further enriches the semantics of InternVideo2 and enhances its user - friendliness in human - computer interactions. By integrating InternVideo2 into a VideoLLM with a LLM and a video BLIP, and adopting the progressive learning scheme in VideoChat, we use InternVideo2 as the video encoder and train a video blip to communicate with open - sourced LLMs. During training, the video encoder will be updated. For detailed training recipes, refer to VideoChat.

The BaseLLM of this model is Mistral - 7B. Before using it, please ensure that you have obtained the access permission of Mistral - 7B. If not, please go to [Mistral - 7B](https://huggingface.co/mistralai/Mistral - 7B - Instruct - v0.3) to obtain the access permission and add your HF_token to the environment variable.

[📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]

🚀 Quick Start

Prerequisites

You agree to not use the model to conduct experiments that cause harm to human subjects.
Fill in the following information:
- Name
- Company/Organization
- Country
- E - Mail
Obtain the access permission of Mistral - 7B from [Mistral - 7B](https://huggingface.co/mistralai/Mistral - 7B - Instruct - v0.3) and add your HF_token to the environment variable.

Installation

Apply for the permission of this project and the base LLM permission.
Fill the HF user access token into the environment variable:

export HF_TOKEN=hf_....

If you don't know how to obtain the token starting with "hf_", please refer to: [How to Get HF User access Token](https://huggingface.co/docs/hub/security - tokens#user - access - tokens) 3. Make sure to have transformers >= 4.39.0, peft==0.5.0 and install the requisite Python packages from pip_requirements:

pip install transformers==4.39.1
pip install peft==0.5.0
pip install timm easydict einops

Usage Examples

Basic Usage

import os
token = os.environ['HF_TOKEN']
import torch

tokenizer =  AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2-Chat-8B', trust_remote_code=True, use_fast=False)

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained(
    'OpenGVLab/InternVideo2-Chat-8B',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).cuda()

from decord import VideoReader, cpu
from PIL import Image
import numpy as np
import numpy as np
import decord
from decord import VideoReader, cpu
import torch.nn.functional as F
import torchvision.transforms as T
from torchvision.transforms import PILToTensor
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
decord.bridge.set_bridge("torch")

def get_index(num_frames, num_segments):
    seg_size = float(num_frames - 1) / num_segments
    start = int(seg_size / 2)
    offsets = np.array([
        start + int(np.round(seg_size * idx)) for idx in range(num_segments)
    ])
    return offsets


def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    num_frames = len(vr)
    frame_indices = get_index(num_frames, num_segments)

    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)

    transform = transforms.Compose([
        transforms.Lambda(lambda x: x.float().div(255.0)),
        transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
        transforms.CenterCrop(224),
        transforms.Normalize(mean, std)
    ])

    frames = vr.get_batch(frame_indices)
    frames = frames.permute(0, 3, 1, 2)
    frames = transform(frames)

    T_, C, H, W = frames.shape
        
    if return_msg:
        fps = float(vr.get_avg_fps())
        sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
        # " " should be added in the start and end
        msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
        return frames, msg
    else:
        return frames

video_path = "yoga.mp4"
# sample uniformly 8 frames from the video
video_tensor = load_video(video_path, num_segments=8, return_msg=False)
video_tensor = video_tensor.to(model.device)

chat_history= []
response, chat_history = model.chat(tokenizer, '', 'describe the action step by step.', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
print(response)
# The video shows a woman performing yoga on a rooftop with a beautiful view of the mountains in the background. She starts by standing on her hands and knees, then moves into a downward dog position, and finally ends with a standing position. Throughout the video, she maintains a steady and fluid movement, focusing on her breath and alignment. The video is a great example of how yoga can be practiced in different environments and how it can be a great way to connect with nature and find inner peace.

response, chat_history = model.chat(tokenizer, '', 'What is she wearing?', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
# The woman in the video is wearing a black tank top and grey yoga pants.
print(response)

📈 Performance

Model	MVBench	VideoMME(w/o sub)
InternVideo2-Chat-8B	60.3	41.9
InternVideo2-Chat-8B-HD	65.4	46.1
InternVideo2-Chat-8B-HD-F16	67.5	49.4
InternVideo2-Chat-8B-InternLM	61.9	49.1

✏️ Citation

If this work is helpful for your research, please consider citing InternVideo and VideoChat.

@article{wang2024internvideo2,
  title={Internvideo2: Scaling video foundation models for multimodal video understanding},
  author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
  journal={arXiv preprint arXiv:2403.15377},
  year={2024}
}

@article{li2023videochat,
  title={Videochat: Chat-centric video understanding},
  author={Li, KunChang and He, Yinan and Wang, Yi and Li, Yizhuo and Wang, Wenhai and Luo, Ping and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2305.06355},
  year={2023}
}

📄 License

This project is under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご