đ InternVideo2-Chat-8B
This project further enriches the semantics of InternVideo2 and enhances its user - friendliness in human - computer interactions. By integrating InternVideo2 into a VideoLLM with a LLM and a video BLIP, and adopting the progressive learning scheme in VideoChat, we use InternVideo2 as the video encoder and train a video blip to communicate with open - sourced LLMs. During training, the video encoder will be updated. For detailed training recipes, refer to VideoChat.
The BaseLLM of this model is Mistral - 7B. Before using it, please ensure that you have obtained the access permission of Mistral - 7B. If not, please go to [Mistral - 7B](https://huggingface.co/mistralai/Mistral - 7B - Instruct - v0.3) to obtain the access permission and add your HF_token
to the environment variable.
[đ GitHub] [đ Tech Report] [đ¨ī¸ Chat Demo]
đ Quick Start
Prerequisites
- You agree to not use the model to conduct experiments that cause harm to human subjects.
- Fill in the following information:
- Name
- Company/Organization
- Country
- E - Mail
- Obtain the access permission of Mistral - 7B from [Mistral - 7B](https://huggingface.co/mistralai/Mistral - 7B - Instruct - v0.3) and add your
HF_token
to the environment variable.
Installation
- Apply for the permission of this project and the base LLM permission.
- Fill the HF user access token into the environment variable:
export HF_TOKEN=hf_....
If you don't know how to obtain the token starting with "hf_", please refer to: [How to Get HF User access Token](https://huggingface.co/docs/hub/security - tokens#user - access - tokens)
3. Make sure to have transformers >= 4.39.0, peft==0.5.0
and install the requisite Python packages from pip_requirements:
pip install transformers==4.39.1
pip install peft==0.5.0
pip install timm easydict einops
Usage Examples
Basic Usage
import os
token = os.environ['HF_TOKEN']
import torch
tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2-Chat-8B', trust_remote_code=True, use_fast=False)
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained(
'OpenGVLab/InternVideo2-Chat-8B',
torch_dtype=torch.bfloat16,
trust_remote_code=True).cuda()
from decord import VideoReader, cpu
from PIL import Image
import numpy as np
import numpy as np
import decord
from decord import VideoReader, cpu
import torch.nn.functional as F
import torchvision.transforms as T
from torchvision.transforms import PILToTensor
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
decord.bridge.set_bridge("torch")
def get_index(num_frames, num_segments):
seg_size = float(num_frames - 1) / num_segments
start = int(seg_size / 2)
offsets = np.array([
start + int(np.round(seg_size * idx)) for idx in range(num_segments)
])
return offsets
def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
num_frames = len(vr)
frame_indices = get_index(num_frames, num_segments)
mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)
transform = transforms.Compose([
transforms.Lambda(lambda x: x.float().div(255.0)),
transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
transforms.CenterCrop(224),
transforms.Normalize(mean, std)
])
frames = vr.get_batch(frame_indices)
frames = frames.permute(0, 3, 1, 2)
frames = transform(frames)
T_, C, H, W = frames.shape
if return_msg:
fps = float(vr.get_avg_fps())
sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
return frames, msg
else:
return frames
video_path = "yoga.mp4"
video_tensor = load_video(video_path, num_segments=8, return_msg=False)
video_tensor = video_tensor.to(model.device)
chat_history= []
response, chat_history = model.chat(tokenizer, '', 'describe the action step by step.', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
print(response)
response, chat_history = model.chat(tokenizer, '', 'What is she wearing?', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
print(response)
đ Performance
âī¸ Citation
If this work is helpful for your research, please consider citing InternVideo and VideoChat.
@article{wang2024internvideo2,
title={Internvideo2: Scaling video foundation models for multimodal video understanding},
author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
journal={arXiv preprint arXiv:2403.15377},
year={2024}
}
@article{li2023videochat,
title={Videochat: Chat-centric video understanding},
author={Li, KunChang and He, Yinan and Wang, Yi and Li, Yizhuo and Wang, Wenhai and Luo, Ping and Wang, Yali and Wang, Limin and Qiao, Yu},
journal={arXiv preprint arXiv:2305.06355},
year={2023}
}
đ License
This project is under the MIT license.