LLaVA-NeXT-Video-7B Open-Source Multimodal Dialogue Robot - Free Support for Video and Text Interaction

Llava NeXT Video 7B

Developed by lmms-lab

LLaVA-Next-Video is an open-source multimodal dialogue robot, fine-tuned from a large language model, supporting multimodal interaction with video and text.

Text-to-Video

Transformers

#Multimodal Dialogue #Video Understanding #Instruction Following

Downloads 1,146

Release Time : 4/16/2024

Model Overview

LLaVA-Next-Video is an open-source dialogue robot based on a large language model, focusing on multimodal instruction-following tasks and supporting video and text interaction.

Model Features

Multimodal Interaction

Supports multimodal input with video and text, capable of understanding and generating text responses related to video content.

Open-source Model

Fully open-source, allowing researchers and developers to freely use and modify.

Instruction Following

Fine-tuned with multimodal instruction-following data, enabling accurate execution of complex multimodal tasks.

Model Capabilities

Video-Text Dialogue

Multimodal Instruction Understanding

Video Content Analysis

Text Generation

Use Cases

Research

Multimodal Model Research

Used in computer vision and natural language processing research to explore the potential of multimodal models.

Education

Video Content Q&A

Used in educational settings where students can ask questions about videos, and the model generates relevant answers.

🚀 LLaVA-Next-Video Model Card

LLaVA-Next-Video is an open - source chatbot model, which provides a new solution for multimodal research through fine - tuning on multimodal instruction - following data.

🚀 Quick Start

This README provides detailed information about the LLaVA-Next-Video model, including its details, license, intended use, training dataset, and evaluation dataset.

✨ Features

Multimodal Capability: LLaVA-Next-Video is trained on multimodal instruction-following data, enabling it to handle both image and video inputs.
Open - Source: It is an open - source chatbot, facilitating research and development in the field of large multimodal models.

📚 Documentation

Model details

Property	Details
Model Type	LLaVA-Next-Video is an open - source chatbot trained by fine - tuning LLM on multimodal instruction - following data. This model is the one mentioned in: https://llava-vl.github.io/blog/2024-04-30-llava-next-video/. Base LLM: lmsys/vicuna - 7b - v1.5
Model Date	LLaVA - Next - Video - 7B was trained in April 2024.
Paper or Resources for More Information	https://github.com/LLaVA-VL/LLaVA-NeXT

License

Where to send questions or comments about the model

https://github.com/LLaVA-VL/LLaVA-NeXT/issues

Intended use

Primary intended uses: The primary use of LLaVA is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

Image

558K filtered image - text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT - generated multimodal instruction - following data.
500K academic - task - oriented VQA data mixture.
50K GPT - 4V data mixture.
40K ShareGPT data.

Video

100K VideoChatGPT - Instruct.

Evaluation dataset

A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご