LLaVA - NeXT - Video開源多模態聊天機器人，免費部署暢享卓越視頻理解能力！

Home

Llava NeXT Video 7B Hf

Developed by llava-hf

LLaVA-NeXT-Video是一個開源多模態聊天機器人，通過視頻和圖像數據混合訓練獲得優秀的視頻理解能力，在VideoMME基準上達到開源模型SOTA水平。

文本生成視頻

Transformers

English#多模態視頻理解 #零樣本學習 #指令跟隨

Downloads 65.95k

Release Time : 6/5/2024

Model Overview

基於LLaVA-NeXT構建的視頻理解模型，支持圖像和視頻的多模態輸入，能夠執行視覺問答、內容描述等任務。

Model Features

視頻理解能力

通過100K VideoChatGPT-Instruct數據訓練，具備優秀的視頻內容理解能力

多模態輸入支持

同時支持圖像和視頻作為輸入，可處理複雜的多模態查詢

開源SOTA

在VideoMME基準測試中是當前開源模型中性能最好的

高效推理

支持4位量化和Flash-Attention 2優化，降低計算資源需求

Model Capabilities

視頻內容理解

圖像內容分析

多模態問答

視頻內容描述

跨模態推理

Use Cases

內容理解

視頻內容分析

分析視頻中的場景、動作和事件

準確描述視頻內容和有趣之處

圖像問答

回答關於圖像內容的各類問題

提供準確的圖像內容解釋

教育

教學視頻理解

解析教育視頻內容，輔助學習

幫助學生理解複雜概念

🚀 LLaVA-NeXT-Video模型卡

LLaVA-NeXT-Video是一個基於多模態數據微調的開源聊天機器人模型，能夠處理視頻和圖像數據，在視頻理解任務上表現出色。它基於lmsys/vicuna-7b-v1.5進行微調，在VideoMME基準測試中達到了當前開源模型的最優性能。

點擊下面的鏈接，在免費的Google Colab實例上運行LLaVA：

免責聲明：發佈LLaVA-NeXT-Video的團隊並未為該模型編寫模型卡，此模型卡由Hugging Face團隊編寫。

🚀 快速開始

模型使用步驟

首先，確保安裝了 transformers >= 4.42.0。該模型支持多視覺和多提示生成，即在提示中可以傳遞多個圖像或視頻。同時，請遵循正確的提示模板 (USER: xxx\nASSISTANT:)，並在需要查詢圖像或視頻的位置添加 <image> 或 <video> 標記。

以下是在GPU設備上以 float16 精度運行生成的示例腳本：

import av
import torch
import numpy as np
from huggingface_hub import hf_hub_download
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration

model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = LlavaNextVideoProcessor.from_pretrained(model_id)

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


# define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image", "video") 
conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video, can sample more for longer videos
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

從 transformers>=v4.48 版本開始，你還可以將圖像或視頻的URL或本地路徑傳遞給對話歷史，讓聊天模板處理後續操作。對於視頻，還需要指定要從視頻中採樣的幀數 num_frames，否則將加載整個視頻。聊天模板會為你加載圖像或視頻，並返回 torch.Tensor 格式的輸入，你可以直接將其傳遞給 model.generate()。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}
            {"type": "video", "path": "my_video.mp4"},
            {"type": "text", "text": "What is shown in this image and video?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, num_frames=8, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors"pt")
output = model.generate(**inputs, max_new_tokens=50)

以圖像為輸入進行推理

在按照上述步驟加載模型後，使用以下代碼以圖像為輸入進行生成：

import requests
from PIL import Image

conversation = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs_image = processor(text=prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

以圖像和視頻為輸入進行推理

在加載模型後，使用以下代碼以圖像和視頻為輸入進行生成：

conversation_1 = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "What's the content of the image>"},
          {"type": "image"},
        ],
    }
]
conversation_2 = [
    {
      "role": "user",
      "content": [
          {"type": "text", "text": "Why is this video funny?"},
          {"type": "video"},
        ],
    },
]
prompt_1 = processor.apply_chat_template(conversation_1, add_generation_prompt=True)
prompt_2 = processor.apply_chat_template(conversation_2, add_generation_prompt=True)

s = processor(text=[prompt_1, prompt_2], images=image, videos=clip, padding=True, return_tensors="pt").to(model.device)

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=100)
out = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(out)

模型優化

通過`bitsandbytes`庫進行4位量化

首先，確保安裝了 bitsandbytes，使用 pip install bitsandbytes 進行安裝，並確保可以訪問支持CUDA的GPU設備。只需將上述代碼片段修改如下：

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

使用Flash-Attention 2進一步加速生成

首先，確保安裝了 flash-attn。有關該包的安裝，請參考 Flash Attention的原始倉庫。只需將上述代碼片段修改如下：

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

✨ 主要特性

多模態處理：支持處理圖像和視頻數據，能夠在提示中傳遞多個圖像或視頻。
高性能：在VideoMME基準測試中達到了當前開源模型的最優性能。
靈活的輸入方式：從 transformers>=v4.48 版本開始，支持傳遞圖像或視頻的URL或本地路徑。
模型優化：支持4位量化和Flash-Attention 2，可加速模型推理。

📦 安裝指南

確保安裝 transformers >= 4.42.0，如果需要進行模型優化，還需要安裝 bitsandbytes 和 flash-attn。

pip install transformers>=4.42.0
pip install bitsandbytes  # 用於4位量化
# 參考 https://github.com/Dao-AILab/flash-attention 安裝 flash-attn

📚 詳細文檔

📄 模型詳情

屬性	詳情
模型類型	LLaVA-Next-Video是一個基於多模態指令跟隨數據微調大語言模型（LLM）得到的開源聊天機器人。該模型在LLaVa-NeXT的基礎上，通過在視頻和圖像數據的混合數據集上進行微調，以實現更好的視頻理解能力。視頻數據均勻採樣為每個片段32幀。該模型在VideoMME基準測試中達到了當前開源模型的最優性能。基礎大語言模型為lmsys/vicuna-7b-v1.5。
模型日期	LLaVA-Next-Video-7B於2024年4月進行訓練。
更多信息的論文或資源	https://github.com/LLaVA-VL/LLaVA-NeXT

llava_next_video_arch

📚 訓練數據集

圖像

從LAION/CC/SBU中篩選出的558K圖像-文本對，由BLIP添加標題。
158K由GPT生成的多模態指令跟隨數據。
500K面向學術任務的VQA混合數據。
50K GPT-4V混合數據。
40K ShareGPT數據。

視頻

100K VideoChatGPT-Instruct數據。

📊 評估數據集

包含4個基準測試的集合，其中包括3個學術VQA基準測試和1個字幕生成基準測試。

🔧 技術細節

LLaVA-Next-Video基於lmsys/vicuna-7b-v1.5進行微調，通過在多模態數據上的訓練，提升了模型的視頻理解能力。模型在訓練過程中，對視頻數據進行均勻採樣，每個視頻片段採樣32幀。在推理過程中，可以通過指定採樣幀數來處理不同長度的視頻。同時，模型支持多視覺和多提示生成，能夠在提示中傳遞多個圖像或視頻。

📄 許可證

Llama 2遵循LLAMA 2社區許可證，版權歸Meta Platforms, Inc.所有。

✏️ 引用

如果您在研究中使用了我們的論文和代碼，請引用以下文獻：

@misc{zhang2024llavanextvideo,
  title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
  url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
  author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
  month={April},
  year={2024}
}

@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}