Llama-3.2V-11B-cot Open-Source Vision-Language Model - Supports Spontaneous Systematic Reasoning

Llama 3.2V 11B Cot

Developed by Xkev

Llama-3.2V-11B-cot is a visual-language model capable of spontaneous and systematic reasoning, developed based on the LLaVA-CoT framework.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Visual-Language Reasoning #Multimodal Chain-of-Thought #Systematic Reasoning

Downloads 5,089

Release Time : 11/19/2024

Model Overview

This model is the first version of LLaVA-CoT, focusing on step-by-step reasoning in visual-language tasks, supporting image-to-text conversion and understanding.

Model Features

Step-by-Step Reasoning

Supports systematic, step-by-step visual-language reasoning, capable of handling complex multimodal tasks.

High-Performance Benchmarking

Performs excellently in multiple visual-language benchmarks, with an average score of 63.5.

Long-Text Generation

Supports generating up to 2048 new tokens, suitable for tasks requiring long-text output.

Model Capabilities

Image Understanding

Text Generation

Multimodal Reasoning

Visual Question Answering

Use Cases

Education

Visual Math Problem Solving

Solving math problems containing diagrams and formulas

Achieved a score of 54.8 on the MathVista benchmark

General AI Assistant

Multimodal Dialogue

Intelligent dialogue based on image and text input

Achieved a score of 75.0 on the MMBench benchmark

🚀 Llama-3.2V-11B-cot Model

Llama-3.2V-11B-cot is the first version of LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning.

🚀 Quick Start

You can use the inference code for Llama-3.2-11B-Vision-Instruct.

✨ Features

Llama-3.2V-11B-cot is a visual language model that can perform spontaneous and systematic reasoning. It is the first version of LLaVA-CoT.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

License: apache-2.0
Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct

Benchmark Results

MMStar	MMBench	MMVet	MathVista	AI2D	Hallusion	Average
57.6	75.0	60.3	54.8	85.7	47.8	63.5

Reproduction

To reproduce our results, you should use VLMEvalKit and the following settings.

Parameter	Value
do_sample	True
temperature	0.6
top_p	0.9
max_new_tokens	2048

You may change them in this file, line 80 - 83, and modify the max_new_tokens throughout the file.

⚠️ Important Note

We follow the same settings as Llama-3.2-11B-Vision-Instruct, except that we extend the max_new_tokens to 2048. After you get the results, you should filter the model output and only keep the outputs between <CONCLUSION> and </CONCLUSION>. This shouldn't have any difference in theory, but empirically we observe some performance difference because the jugder GPT-4o can be inaccurate sometimes. By keeping the outputs between <CONCLUSION> and </CONCLUSION>, most answers can be direclty extracted using VLMEvalKit system, which can be much less biased.

Training Details

Training Data

The model is trained on the LLaVA-CoT-100k dataset.

Training Procedure

The model is finetuned on llama-recipes with the following settings. Using the same setting should accurately reproduce our results.

Parameter	Value
FSDP	enabled
lr	1e-5
num_epochs	3
batch_size_training	4
use_fast_kernels	True
run_validation	False
batching_strategy	padding
context_length	4096
gradient_accumulation_steps	1
gradient_clipping	False
gradient_clipping_threshold	1.0
weight_decay	0.0
gamma	0.85
seed	42
use_fp16	False
mixed_precision	True

Bias, Risks, and Limitations

The model may generate biased or offensive content, similar to other VLMs, due to limitations in the training data. Technically, the model's performance in aspects like instruction following still falls short of leading industry models.

🔧 Technical Details

No specific technical implementation details (more than 50 words) are provided in the original document, so this section is skipped.

📄 License

The model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご