InternVideo2-Chat-8B開源視頻理解模型 - 免費支持視頻語義理解與人機交互

首頁

Internvideo2 Chat 8B

由OpenGVLab開發

InternVideo2-Chat-8B是一個結合大型語言模型(LLM)和視頻BLIP的視頻理解模型，通過漸進式學習方案構建，能夠進行視頻語義理解和人機交互。

視頻生成文本

Transformers

英語開源協議:MIT #視頻語義理解 #漸進式學習 #多模態交互

下載量 492

發布時間 : 8/1/2024

模型概述

該模型通過將InternVideo2作為視頻編碼器，並與Mistral-7B等大型語言模型結合，構建了VideoLLM進行微調，提升了視頻語義內涵和人機交互友好性。

模型特點

漸進式學習方案

採用VideoChat的漸進式學習方案，訓練視頻BLIP模塊與開源LLM進行交互，視頻編碼器會持續更新。

高性能視頻理解

在MVBench和VideoMME等基準測試中表現出色，能夠準確理解視頻內容並進行語義分析。

多模態交互

結合視頻和文本輸入，支持複雜的多模態交互任務，如視頻內容描述和問答。

模型能力

視頻內容理解

視頻問答

視頻內容描述

多模態交互

使用案例

視頻分析

視頻內容描述

對視頻內容進行詳細描述，如動作細節、場景信息等。

視頻展示了一位女士在可俯瞰山景的屋頂練習瑜伽。她首先以手膝支撐姿勢開始，隨後過渡到下犬式，最終以站立姿勢結束。

視頻問答

回答關於視頻內容的特定問題，如人物服裝、動作細節等。

視頻中的女士穿著黑色背心和灰色瑜伽褲。

人機交互

自然語言交互

支持通過自然語言與模型進行交互，獲取視頻內容的詳細信息。

🚀 InternVideo2-Chat-8B

InternVideo2-Chat-8B是一個視頻文本交互模型，通過結合視頻編碼器和大語言模型，提升了視頻語義理解和人機交互的友好性，能處理多種視頻相關的問答任務。

[📂 GitHub] [📜 技術報告] [🗨️ 聊天演示]

🚀 快速開始

為了進一步豐富 InternVideo2 中嵌入的語義，並提高其在人機通信中的易用性，我們將InternVideo2與大語言模型（LLM）和視頻BLIP集成到一個VideoLLM中進行微調。我們採用了 VideoChat 中的漸進式學習方案，使用InternVideo2作為視頻編碼器，並訓練了一個視頻BLIP以與開源LLM進行通信。在訓練過程中，視頻編碼器會被更新。詳細的訓練方法請參考 VideoChat。

該模型的基礎大語言模型是Mistral-7B。在使用之前，請確保你已經獲得了Mistral-7B的訪問權限，如果尚未獲得，請前往Mistral-7B獲取訪問權限，並將你的 HF_token 添加到環境變量中。

✨ 主要特性

📈 性能表現

模型	MVBench	無字幕VideoMME
InternVideo2-Chat-8B	60.3	41.9
InternVideo2-Chat-8B-HD	65.4	46.1
InternVideo2-Chat-8B-HD-F16	67.5	49.4
InternVideo2-Chat-8B-InternLM	61.9	49.1

📦 安裝指南

申請該項目的權限和基礎大語言模型的訪問權限。
將HF用戶訪問令牌填充到環境變量中。

export HF_TOKEN=hf_....

如果你不知道如何獲取以 "hf_" 開頭的令牌，請參考：如何獲取HF用戶訪問令牌。 3. 確保安裝 transformers >= 4.39.0 和 peft==0.5.0。

pip install transformers==4.39.1
pip install peft==0.5.0
pip install timm easydict einops

從 pip_requirements 安裝必要的Python包。

💻 使用示例

基礎用法

import os
token = os.environ['HF_TOKEN']
import torch

tokenizer =  AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2-Chat-8B', trust_remote_code=True, use_fast=False)

from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained(
    'OpenGVLab/InternVideo2-Chat-8B',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).cuda()

from decord import VideoReader, cpu
from PIL import Image
import numpy as np
import numpy as np
import decord
from decord import VideoReader, cpu
import torch.nn.functional as F
import torchvision.transforms as T
from torchvision.transforms import PILToTensor
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
decord.bridge.set_bridge("torch")

def get_index(num_frames, num_segments):
    seg_size = float(num_frames - 1) / num_segments
    start = int(seg_size / 2)
    offsets = np.array([
        start + int(np.round(seg_size * idx)) for idx in range(num_segments)
    ])
    return offsets


def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    num_frames = len(vr)
    frame_indices = get_index(num_frames, num_segments)

    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)

    transform = transforms.Compose([
        transforms.Lambda(lambda x: x.float().div(255.0)),
        transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
        transforms.CenterCrop(224),
        transforms.Normalize(mean, std)
    ])

    frames = vr.get_batch(frame_indices)
    frames = frames.permute(0, 3, 1, 2)
    frames = transform(frames)

    T_, C, H, W = frames.shape
        
    if return_msg:
        fps = float(vr.get_avg_fps())
        sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
        # " " should be added in the start and end
        msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
        return frames, msg
    else:
        return frames

video_path = "yoga.mp4"
# sample uniformly 8 frames from the video
video_tensor = load_video(video_path, num_segments=8, return_msg=False)
video_tensor = video_tensor.to(model.device)

chat_history= []
response, chat_history = model.chat(tokenizer, '', 'describe the action step by step.', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
print(response)
# The video shows a woman performing yoga on a rooftop with a beautiful view of the mountains in the background. She starts by standing on her hands and knees, then moves into a downward dog position, and finally ends with a standing position. Throughout the video, she maintains a steady and fluid movement, focusing on her breath and alignment. The video is a great example of how yoga can be practiced in different environments and how it can be a great way to connect with nature and find inner peace.

response, chat_history = model.chat(tokenizer, '', 'What is she wearing?', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
# The woman in the video is wearing a black tank top and grey yoga pants.
print(response)

✏️ 引用說明

如果這項工作對你的研究有幫助，請考慮引用InternVideo和VideoChat。

@article{wang2024internvideo2,
  title={Internvideo2: Scaling video foundation models for multimodal video understanding},
  author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
  journal={arXiv preprint arXiv:2403.15377},
  year={2024}
}

@article{li2023videochat,
  title={Videochat: Chat-centric video understanding},
  author={Li, KunChang and He, Yinan and Wang, Yi and Li, Yizhuo and Wang, Wenhai and Luo, Ping and Wang, Yali and Wang, Limin and Qiao, Yu},
  journal={arXiv preprint arXiv:2305.06355},
  year={2023}
}

📄 許可證

本項目採用MIT許可證。

⚠️ 重要提示

你同意不使用該模型進行對人類受試者造成傷害的實驗。

屬性	詳情
模型類型	視頻文本交互模型
訓練數據	未提及
許可證	MIT
管道標籤	視頻文本到文本
額外的訪問權限提示	你同意不使用該模型進行對人類受試者造成傷害的實驗。
額外的訪問權限字段	姓名、公司/組織、國家、電子郵件
語言	英文
標籤	視頻