The open-source multimodal model Videollm-online-8b-v1plus - Free assistance for online video understanding and content generation

Videollm Online 8b V1plus

Developed by chenjoya

VideoLLM-online is a multimodal large language model based on Llama-3-8B-Instruct, focusing on online video understanding and video-text generation tasks.

Video-to-Text

Safetensors

EnglishOpen Source License:MIT #Real-time video understanding #Multimodal LLM #Long video processing

Downloads 1,688

Release Time : 6/22/2024

Model Overview

This model combines visual and language processing capabilities, can process video streams up to 10 minutes in real-time, supports frame rate analysis of 2 - 10 frames per second, and is suitable for online video understanding and interactive application scenarios.

Model Features

Real-time video processing

Supports real-time video stream processing at 2 - 10 frames per second and can handle video content up to 10 minutes long

Multimodal understanding

Combines a visual encoder (SigLIP) and a language model (Llama-3) to achieve in-depth understanding of video content

Efficient visual encoding

Adopts a 3x3 token strategy of CLS token + average pooling to maintain efficient processing at 384 resolution

Large-scale training data

Trained using 134K video samples from the Ego4D dataset, covering diverse scenarios

Model Capabilities

Online video understanding

Video content description generation

Multimodal reasoning

Real-time video interaction

Use Cases

Video analysis

Video content summary

Automatically generate content summaries for long videos

Can process 10-minute videos and generate accurate summaries

Real-time video Q&A

Conduct real-time Q&A on the currently playing video content

Supports real-time response at 2 - 10 frames per second

Human-computer interaction

Video-assisted dialogue

A natural language dialogue system based on video content

Can conduct in-depth exchanges with users about video content

🚀 Model Card for Model ID

This model card provides details about a video-text-to-text model, which combines a large language model with vision strategies for online video understanding. It also includes information on installation, usage, and citation.

🚀 Quick Start

To get started with the model, follow these steps:

Clone the GitHub repository:

git clone https://github.com/showlab/videollm-online

Ensure you have Miniconda and Python version >= 3.10 installed, then run the following commands:

conda install -y pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers accelerate deepspeed peft editdistance Levenshtein tensorboard gradio moviepy submitit
pip install flash-attn --no-build-isolation

Install the newest ffmpeg:

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg

If you want to try the model with audio in real-time streaming, also clone ChatTTS:

pip install omegaconf vocos vector_quantize_pytorch cython
git clone git+https://github.com/2noise/ChatTTS
mv ChatTTS demo/rendering/

Launch the gradio demo locally:

python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

Or launch the CLI locally:

python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus

✨ Features

Multimodal Capability: Combines a large language model (LLM) with vision strategies for video understanding.
Online Processing: Designed for online video streaming.
Flexible Frame Settings: Allows different frame FPS and resolutions.

📦 Installation

The installation process involves cloning the repository, installing necessary Python packages, and getting the latest ffmpeg. If you want audio support, you also need to clone ChatTTS.

📚 Documentation

Model Details

Property	Details
LLM	meta-llama/Meta-Llama-3-8B-Instruct
Vision Strategy - Frame Encoder	google/siglip-large-patch16-384
Vision Strategy - Frame Tokens	CLS Token + Avg Pooled 3x3 Tokens
Vision Strategy - Frame FPS	2 for training, 2~10 for inference
Vision Strategy - Frame Resolution	max resolution 384, with zero-padding to keep aspect ratio
Vision Strategy - Video Length	10 minutes
Training Data	Ego4D Narration Stream 113K + Ego4D GoalStep Stream 21K

Model Sources

Repository: https://github.com/showlab/videollm-online
Paper: https://arxiv.org/abs/2406.11816

📄 License

This project is licensed under the MIT license.

📄 Citation

@inproceedings{videollm-online,
  author       = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou},
  title        = {VideoLLM-online: Online Video Large Language Model for Streaming Video},
  booktitle    = {CVPR},
  year         = {2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご