videollm-online-8b-v1plus開源多模態模型 - 免費助力在線視頻理解與內容生成

首頁

Videollm Online 8b V1plus

由chenjoya開發

VideoLLM-online是一個基於Llama-3-8B-Instruct的多模態大語言模型，專注於在線視頻理解和視頻-文本生成任務。

視頻生成文本

Safetensors

英語開源協議:MIT #即時視頻理解 #多模態LLM #長視頻處理

下載量 1,688

發布時間 : 6/22/2024

模型概述

該模型結合了視覺和語言處理能力，能夠即時處理長達10分鐘的視頻流，支持2-10幀/秒的幀率分析，適用於在線視頻理解和交互式應用場景。

模型特點

即時視頻處理

支持2-10幀/秒的即時視頻流處理，可處理長達10分鐘的視頻內容

多模態理解

結合視覺編碼器(SigLIP)和語言模型(Llama-3)，實現視頻內容的深度理解

高效視覺編碼

採用CLS標記+平均池化的3x3標記策略，在384分辨率下保持高效處理

大規模訓練數據

使用Ego4D數據集的134K視頻樣本進行訓練，涵蓋多樣場景

模型能力

在線視頻理解

視頻內容描述生成

多模態推理

即時視頻交互

使用案例

視頻分析

視頻內容摘要

自動生成長視頻的內容摘要

可處理10分鐘視頻並生成準確摘要

即時視頻問答

對正在播放的視頻內容進行即時問答

支持2-10幀/秒的即時響應

人機交互

視頻輔助對話

基於視頻內容的自然語言對話系統

可與用戶進行關於視頻內容的深入交流

🚀 視頻在線大語言模型（VideoLLM-online）

VideoLLM-online 是一款用於流式視頻的在線視頻大語言模型，支持多模態交互，能有效處理視頻流理解等任務，在視頻文本生成等方面具有顯著價值。

🚀 快速開始

首先，克隆 GitHub 倉庫並按照安裝說明進行操作：

git clone https://github.com/showlab/videollm-online

確保你已經安裝了 Miniconda 且 Python 版本 >= 3.10，然後運行以下命令：

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

PyTorch 源會安裝 ffmpeg，但版本較舊，通常會導致預處理質量很低。請按照以下步驟安裝最新版本的 ffmpeg：

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg

如果你想在即時流中使用音頻來嘗試我們的模型，請同時克隆 ChatTTS：

pip install omegaconf vocos vector_quantize_pytorch cython
git clone git+https://github.com/2noise/ChatTTS
mv ChatTTS demo/rendering/

在本地啟動 Gradio 演示：

python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

或者在本地啟動命令行界面：

python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

✨ 主要特性

多模態支持：結合了大語言模型（LLM）和視覺策略，實現視頻文本的多模態處理。
靈活的幀率設置：訓練時幀率為 2，推理時幀率在 2 - 10 之間。
長視頻處理：能夠處理長達 10 分鐘的視頻。

📦 安裝指南

克隆倉庫

git clone https://github.com/showlab/videollm-online

安裝依賴

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

安裝最新版 ffmpeg

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg

克隆 ChatTTS（可選）

pip install omegaconf vocos vector_quantize_pytorch cython
git clone git+https://github.com/2noise/ChatTTS
mv ChatTTS demo/rendering/

📚 詳細文檔

模型詳情

屬性	詳情
大語言模型（LLM）	meta-llama/Meta-Llama-3-8B-Instruct
視覺策略 - 幀編碼器	google/siglip-large-patch16-384
視覺策略 - 幀令牌	CLS 令牌 + 平均池化 3x3 令牌
視覺策略 - 幀幀率	訓練時為 2，推理時為 2~10
視覺策略 - 幀分辨率	最大分辨率 384，零填充以保持寬高比
視覺策略 - 視頻長度	10 分鐘
訓練數據	Ego4D 敘述流 113K + Ego4D 目標步驟流 21K

模型來源

倉庫地址：https://github.com/showlab/videollm-online
論文地址：https://arxiv.org/abs/2406.11816

📄 許可證

本模型採用 MIT 許可證。

📚 引用信息

@inproceedings{videollm-online,
  author       = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou},
  title        = {VideoLLM-online: Online Video Large Language Model for Streaming Video},
  booktitle    = {CVPR},
  year         = {2024},
}