đ Llama-3.2V-11B-cot Model
Llama-3.2V-11B-cot is the first version of LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning.
đ Quick Start
You can use the inference code for Llama-3.2-11B-Vision-Instruct.
⨠Features
Llama-3.2V-11B-cot is a visual language model that can perform spontaneous and systematic reasoning. It is the first version of LLaVA-CoT.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model Details
- License: apache-2.0
- Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct
Benchmark Results
MMStar |
MMBench |
MMVet |
MathVista |
AI2D |
Hallusion |
Average |
57.6 |
75.0 |
60.3 |
54.8 |
85.7 |
47.8 |
63.5 |
Reproduction
To reproduce our results, you should use VLMEvalKit and the following settings.
Parameter |
Value |
do_sample |
True |
temperature |
0.6 |
top_p |
0.9 |
max_new_tokens |
2048 |
You may change them in this file, line 80 - 83, and modify the max_new_tokens throughout the file.
â ī¸ Important Note
We follow the same settings as Llama-3.2-11B-Vision-Instruct, except that we extend the max_new_tokens to 2048. After you get the results, you should filter the model output and only keep the outputs between <CONCLUSION> and </CONCLUSION>. This shouldn't have any difference in theory, but empirically we observe some performance difference because the jugder GPT-4o can be inaccurate sometimes. By keeping the outputs between <CONCLUSION> and </CONCLUSION>, most answers can be direclty extracted using VLMEvalKit system, which can be much less biased.
Training Details
Training Data
The model is trained on the LLaVA-CoT-100k dataset.
Training Procedure
The model is finetuned on llama-recipes with the following settings. Using the same setting should accurately reproduce our results.
Parameter |
Value |
FSDP |
enabled |
lr |
1e-5 |
num_epochs |
3 |
batch_size_training |
4 |
use_fast_kernels |
True |
run_validation |
False |
batching_strategy |
padding |
context_length |
4096 |
gradient_accumulation_steps |
1 |
gradient_clipping |
False |
gradient_clipping_threshold |
1.0 |
weight_decay |
0.0 |
gamma |
0.85 |
seed |
42 |
use_fp16 |
False |
mixed_precision |
True |
Bias, Risks, and Limitations
The model may generate biased or offensive content, similar to other VLMs, due to limitations in the training data. Technically, the model's performance in aspects like instruction following still falls short of leading industry models.
đ§ Technical Details
No specific technical implementation details (more than 50 words) are provided in the original document, so this section is skipped.
đ License
The model is licensed under the apache-2.0 license.