Volcano-7b Open-source Multimodal Revision Model - Optimizing Content Adjustment with Mixed Data

Volcano 7b

Developed by kaist-ai

Volcano-7b is a multimodal self-feedback guided revision model, fine-tuned on the vicuna-7b-v1.5 model using a mixed visual instruction tuning dataset with multimodal feedback and revision data.

Image-to-Text

Transformers

English#Multimodal Self-Feedback #Visual Question Answering Optimization #Iterative Revision

Downloads 268

Release Time : 11/13/2023

Model Overview

Volcano-7b is a large multimodal model that employs an iterative 'critique-revise-decide' loop process to generate and improve responses, suitable for image-to-text and visual question answering tasks.

Model Features

Multimodal Self-Feedback Mechanism

Adopts a unique 'critique-revise-decide' loop process capable of self-evaluating and improving generated content.

Large-Scale Multimodal Training

Incorporates 274K multimodal feedback and revision data along with various vision-language datasets.

Iterative Content Optimization

Capable of continuously improving the quality of generated responses through multiple iterative cycles.

Model Capabilities

Image Caption Generation

Visual Question Answering

Multimodal Content Understanding

Self-Feedback and Revision

Multi-turn Dialogue

Use Cases

Education

Visual Learning Aid

Helps students understand complex diagrams and scientific images.

Provides accurate and easy-to-understand image descriptions.

Content Moderation

Image Content Analysis

Automatically identifies and describes sensitive content in images.

Improves efficiency and accuracy of content moderation.

Assistive Technology

Visual Impairment Assistance

Provides detailed image descriptions for visually impaired users.

Enhances accessibility experience.

🚀 Volcano - Multimodal Self - Feedback Guided Revision Model

Volcano is a multimodal model that uses a single LMM to generate responses, feedback, and revisions, following an iterative critique - revision - decide loop, which is useful for image - to - text tasks, visual question - answering, and image captioning.

🚀 Quick Start

This section provides an overview of the Volcano model and its related information. For more details, please refer to the following sections.

✨ Features

Volcano employs a single LMM to generate initial responses, feedback, and revisions, as well as decisions to accept revisions.
It follows a sequential procedure of an iterative critique - revision - decide loop.

📚 Documentation

🔍 Links for Reference

Repository: https://github.com/kaistAI/Volcano
Paper: https://arxiv.org/abs/2311.07362

📋 Overview

image/png Volcano employs a single LMM to generate initial responses, feedback, and revisions, as well as decisions to accept revisions. It follows a sequential procedure of an iterative critique - revision - decide loop.

📊 Model details

Property	Details
Model Type	Volcano - 7b is a multimodal self - feedback guided revision model that was fine - tuned by mixing the visual instruction tuning dataset used in [LLaVA - v1.5](https://llava - vl.github.io/) with multimodal feedback and revision data collected through [gpt - 3.5 - turbo](https://platform.openai.com/docs/models/gpt - 3 - 5), applied to the [vicuna - 7b - v1.5](https://huggingface.co/lmsys/vicuna - 7b - v1.5) model.
Model Date	Volcano - 7b was trained in October 2023.

Property

Details

Model Type

Volcano - 7b is a multimodal self - feedback guided revision model that was fine - tuned by mixing the visual instruction tuning dataset used in [LLaVA - v1.5](https://llava - vl.github.io/) with multimodal feedback and revision data collected through [gpt - 3.5 - turbo](https://platform.openai.com/docs/models/gpt - 3 - 5), applied to the [vicuna - 7b - v1.5](https://huggingface.co/lmsys/vicuna - 7b - v1.5) model.

Model Date

Volcano - 7b was trained in October 2023.

📈 Training dataset

274K multimodal feedback and revision data
558K filtered image - text pairs from LAION/CC/SBU, captioned by BLIP.
158K GPT - generated multimodal instruction - following data.
450K academic - task - oriented VQA data mixture.
40K ShareGPT data

You can find [here](https://huggingface.co/datasets/kaist - ai/volcano - train) the dataset used to train Volcano, which includes all the aforementioned datasets.

📉 Evaluation dataset

A collection of three multimodal hallucination benchmarks ([MMHal - Bench](https://huggingface.co/datasets/Shengcao1006/MMHal - Bench), Pope, [GAVIE](https://github.com/FuxiaoLiu/LRV - Instruction)) and two multimodal understanding benchmarks ([MM - Vet](https://github.com/yuweihao/MM - Vet), [MMBench](https://github.com/open - compass/MMBench)).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご