L

Llava Next Inst It Vicuna 7B

Developed by Inst-IT
LLaVA-Next-Inst-It-Vicuna-7B is a model that excels in multimodal instance-level understanding, enhanced through explicit visual prompt instruction tuning.
Downloads 14
Release Time : 12/5/2024

Model Overview

This model is based on the LLaVA-NeXT architecture, combined with the Vicuna-7B language model, focusing on multimodal instance-level understanding tasks, supporting fine-grained analysis of images and videos.

Model Features

Multimodal instance-level understanding
Enhances fine-grained understanding of instances in images and videos through explicit visual prompt instruction tuning.
Supports Set-of-Marks visual prompts
Enables more precise instance referencing and analysis using Set-of-Marks visual prompts.
Video frame timestamp referencing
Supports referencing specific frames in videos via timestamps, enabling temporally-aware multimodal understanding.

Model Capabilities

Instance-level image description
Temporal video analysis
Multimodal Q&A
Fine-grained visual understanding
Open-ended text generation

Use Cases

Image understanding
Image instance description
Provides detailed descriptions of specific instances in images, supporting referencing via instance IDs.
Achieves 68.6% accuracy on the Inst-IT-Bench-I-OE dataset.
Video understanding
Temporal video analysis
Analyzes content changes at specific timestamps in videos, supporting timestamp referencing.
Achieves 49.3% accuracy on the Inst-IT-Bench-V-OE dataset.
Multimodal Q&A
Image Q&A
Answers complex questions about image content, including instance-level details.
Achieves 65.9% accuracy on the GQA dataset.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase